EMViT-DDPM: An Equilibrium-Based ViT Diffusion Framework for Data Augmentation in Multispectral Land Cover Classification

The scarcity and imbalance of labeled samples, common in remote sensing datasets, pose significant challenges for accurate analysis and classification, often leading to substantial bias against minority classes. To address these issues, we propose EMViT-DDPM, an equilibrium-based data augmentation framework leveraging vision transformer (ViT)-based denoising diffusion probabilistic models (DDPMs). Unlike generative adversarial network (GAN)-based augmentation techniques commonly found in the literature, our framework is not tied to any specific classifier. It leverages a ViT architecture enhanced with the AdaLN block, which is designed to minimize computational costs while effectively capturing data complexity. By adopting diffusion models (DDPMs), the framework achieves greater training robustness, improved generalization, and better quality control over generated samples compared to GANs. To address class imbalance, we introduce equilibrium-based data augmentation (), which assigns different augmentation proportions to each class based on their respective sizes. In addition, a superpixel-based segmentation preprocessing step is proposed for patch generation, tailoring the data augmentation method specifically to high-spatial-resolution multispectral remote sensing imagery. Finally, we propose a novel strategy for the evaluation of data-augmentation quality, based on a new judge model trained over balanced classes. This allows a more precise evaluation of Fréchet inception distance, precision (fidelity), and recall (diversity).

keywords: bias mitigation, class imbalance, data augmentation, data scarcity, diffusion models, Earth observation (EO)