Section 01
[Main Post/Introduction] RemoteShield: Building a Robust Multimodal Large Model for Earth Observation
To address the performance degradation of existing remote sensing multimodal large language models (MLLMs) under real-world environmental noise (e.g., visual noise like cloud occlusion and haze coverage, text noise like colloquial expressions and ambiguous instructions), the RemoteShield framework is proposed. Through the construction of semantic equivalence clusters and cross-condition preference learning, this framework aligns the semantics of clean and perturbed inputs. It achieves stronger robustness, cross-condition consistency, and maintains competitiveness on clean data across three Earth observation tasks: scene classification, object detection, and visual question answering.