Zing Forum

Reading

VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)

The ACL 2026 main conference paper VAPO proposes a visually anchored policy optimization method, which resolves the visual interference problem of multimodal large language models (OLLMs) in slide speech recognition through a "look first, listen later" reasoning chain, and opensources the SlideASR-Bench benchmark dataset.

语音识别多模态学习视觉干扰全模态大模型强化学习基准数据集
Published 2026-04-07 09:15Recent activity 2026-04-07 15:19Estimated read 5 min
VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)
1

Section 01

[Introduction] VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)

The ACL 2026 main conference paper proposes the Visually-Anchored Policy Optimization (VAPO) method, which resolves the visual interference problem of multimodal large language models (OLLMs) in slide speech recognition through a "look first, listen later" reasoning chain, and opensources the SlideASR-Bench benchmark dataset, effectively improving the performance of key tasks such as technical term recognition.

2

Section 02

Research Background and Core Problem: The Dilemma of Visual Interference in Slide Speech Recognition

Research Background

In scenarios such as modern meetings and academic speeches, speech recognition for slide-assisted presentations needs to integrate audio and visual information. However, multimodal large language models (OLLMs) have a visual interference problem: the model tends to copy text from slides instead of transcribing the actual speech, leading to "hallucinations" (e.g., the slide shows "deep learning" but the model transcribes that content even if the speaker says "machine learning").

Root Cause of Interference

The visual-language pre-training of OLLMs forms a visual priority tendency, which conflicts with the task goal of "faithfully transcribing speech".

3

Section 03

VAPO Method: Innovative Idea of Visually Anchored Policy Optimization

The core of VAPO (Visually Anchored Policy Optimization) is to reshape the reasoning chain into "look first, listen later":

  1. Temporal Decoupling Strategy: First extract visual priors as semantic anchors, then combine with audio to generate transcriptions;
  2. Multi-Objective Reinforcement Learning Optimization: Balance the assistance of visual information and audio fidelity, alleviate interference while improving the performance of entity recognition (especially technical terms).
4

Section 04

SlideASR-Bench: A Comprehensive Benchmark Dataset for Slide Speech Recognition

To address the scarcity of entity-rich data, the team built SlideASR-Bench:

  • Synthetic Corpus (SlideASR-S): Precisely control content distribution, noise, etc., for model training;
  • Real Test Set (SlideASR-R): Derived from actual speeches to evaluate performance in real scenarios; The dataset has been opensourced on Hugging Face.
5

Section 05

Experimental Validation: Significant Improvements of VAPO in Performance and Interference Mitigation

Experimental results show the advantages of VAPO:

  1. End-to-End Performance: Reduced Word Error Rate (WER) and improved entity-level F1 score;
  2. Interference Mitigation: Significant decrease in the frequency of visual hallucinations;
  3. Domain Adaptability: Enhanced recognition capability for technical terms in professional fields (such as medicine and law).
6

Section 06

Open Source Contributions and Application Prospects: Value from Tools to Real-World Scenarios

Open Source Contributions

  • Models: 3B/7B parameter VAPO models opensourced on Hugging Face;
  • Tools: Complete training/evaluation code, preprocessing scripts, etc., supporting reproduction and extension.

Application Prospects

Scenarios such as online education (automatic subtitles), corporate meetings (intelligent minutes), and accessible access (hearing-impaired assistance); technically, it provides new ideas for the modal competition problem in multimodal fusion.

7

Section 07

Conclusion: Laying the Foundation for Multimodal Speech Recognition Research

VAPO solves the visual interference problem through innovative strategies, and combined with the SlideASR-Bench dataset, it provides a solid foundation for subsequent research in the field of slide speech recognition, promoting the development of multimodal AI technology in related applications.