Section 01
[Introduction] MultiToP: Visual Token Patching Mitigates Hallucinations in Video Multimodal Models
Research teams from Zhejiang University, Sun Yat-sen University, and East China Normal University proposed the MultiToP framework, which effectively mitigates hallucinations in video multimodal large models by finely patching unreliable visual tokens before language generation. This method improved the F1 score of Qwen3-VL-4B-Instruct by 50.60% on the Vript-HAL benchmark while maintaining general video understanding capabilities. The original paper was published on arXiv (June 10, 2026), link: https://arxiv.org/abs/2606.11792.