# VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)

> The ACL 2026 main conference paper VAPO proposes a visually anchored policy optimization method, which resolves the visual interference problem of multimodal large language models (OLLMs) in slide speech recognition through a "look first, listen later" reasoning chain, and opensources the SlideASR-Bench benchmark dataset.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-07T01:15:21.000Z
- 最近活动: 2026-04-07T07:19:07.972Z
- 热度: 140.9
- 关键词: 语音识别, 多模态学习, 视觉干扰, 全模态大模型, 强化学习, 基准数据集
- 页面链接: https://www.zingnex.cn/en/forum/thread/vaposlideasr-bench
- Canonical: https://www.zingnex.cn/forum/thread/vaposlideasr-bench
- Markdown 来源: floors_fallback

---

## [Introduction] VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)

The ACL 2026 main conference paper proposes the Visually-Anchored Policy Optimization (VAPO) method, which resolves the visual interference problem of multimodal large language models (OLLMs) in slide speech recognition through a "look first, listen later" reasoning chain, and opensources the SlideASR-Bench benchmark dataset, effectively improving the performance of key tasks such as technical term recognition.

## Research Background and Core Problem: The Dilemma of Visual Interference in Slide Speech Recognition

### Research Background
In scenarios such as modern meetings and academic speeches, speech recognition for slide-assisted presentations needs to integrate audio and visual information. However, multimodal large language models (OLLMs) have a **visual interference** problem: the model tends to copy text from slides instead of transcribing the actual speech, leading to "hallucinations" (e.g., the slide shows "deep learning" but the model transcribes that content even if the speaker says "machine learning").
### Root Cause of Interference
The visual-language pre-training of OLLMs forms a visual priority tendency, which conflicts with the task goal of "faithfully transcribing speech".

## VAPO Method: Innovative Idea of Visually Anchored Policy Optimization

The core of VAPO (Visually Anchored Policy Optimization) is to reshape the reasoning chain into "look first, listen later":
1. **Temporal Decoupling Strategy**: First extract visual priors as semantic anchors, then combine with audio to generate transcriptions;
2. **Multi-Objective Reinforcement Learning Optimization**: Balance the assistance of visual information and audio fidelity, alleviate interference while improving the performance of entity recognition (especially technical terms).

## SlideASR-Bench: A Comprehensive Benchmark Dataset for Slide Speech Recognition

To address the scarcity of entity-rich data, the team built SlideASR-Bench:
- **Synthetic Corpus (SlideASR-S)**: Precisely control content distribution, noise, etc., for model training;
- **Real Test Set (SlideASR-R)**: Derived from actual speeches to evaluate performance in real scenarios;
The dataset has been opensourced on Hugging Face.

## Experimental Validation: Significant Improvements of VAPO in Performance and Interference Mitigation

Experimental results show the advantages of VAPO:
1. **End-to-End Performance**: Reduced Word Error Rate (WER) and improved entity-level F1 score;
2. **Interference Mitigation**: Significant decrease in the frequency of visual hallucinations;
3. **Domain Adaptability**: Enhanced recognition capability for technical terms in professional fields (such as medicine and law).

## Open Source Contributions and Application Prospects: Value from Tools to Real-World Scenarios

### Open Source Contributions
- Models: 3B/7B parameter VAPO models opensourced on Hugging Face;
- Tools: Complete training/evaluation code, preprocessing scripts, etc., supporting reproduction and extension.
### Application Prospects
Scenarios such as online education (automatic subtitles), corporate meetings (intelligent minutes), and accessible access (hearing-impaired assistance); technically, it provides new ideas for the modal competition problem in multimodal fusion.

## Conclusion: Laying the Foundation for Multimodal Speech Recognition Research

VAPO solves the visual interference problem through innovative strategies, and combined with the SlideASR-Bench dataset, it provides a solid foundation for subsequent research in the field of slide speech recognition, promoting the development of multimodal AI technology in related applications.
