Section 01
Latent Space Denoising: A New Paradigm for Enhancing Visual Alignment of Multimodal Large Models (Introduction)
This paper proposes a latent space denoising framework that enhances the internal visual representation alignment capability of multimodal large models through a saliency-aware token masking and Gaussian noise mixing strategy. The method achieves significant improvements in standard benchmark tests (such as VQA-v2, GQA) and compositional robustness tests (such as NaturalBench), with zero additional overhead during inference.