Section 01
[Introduction] VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)
The ACL 2026 main conference paper proposes the Visually-Anchored Policy Optimization (VAPO) method, which resolves the visual interference problem of multimodal large language models (OLLMs) in slide speech recognition through a "look first, listen later" reasoning chain, and opensources the SlideASR-Bench benchmark dataset, effectively improving the performance of key tasks such as technical term recognition.