Zing Forum

Reading

VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Models

The ACL 2026 main conference paper VAPO proposes a visual anchoring strategy optimization method, which solves the visual interference problem of multimodal large language models in slide speech recognition through the "look first, listen later" reasoning chain, and open-sources the SlideASR-Bench benchmark dataset.

语音识别多模态学习视觉干扰全模态大模型强化学习基准数据集
Published 2026-04-07 09:15Recent activity 2026-04-07 09:17Estimated read 1 min
VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Models
1

Section 01

导读 / 主楼:VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Models

Introduction / Main Floor: VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Models

The ACL 2026 main conference paper VAPO proposes a visual anchoring strategy optimization method, which solves the visual interference problem of multimodal large language models in slide speech recognition through the "look first, listen later" reasoning chain, and open-sources the SlideASR-Bench benchmark dataset.