Reading

VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)

The ACL 2026 main conference paper VAPO proposes a visually anchored policy optimization method, which resolves the visual interference problem of multimodal large language models (OLLMs) in slide speech recognition through a "look first, listen later" reasoning chain, and opensources the SlideASR-Bench benchmark dataset.

语音识别多模态学习视觉干扰全模态大模型强化学习基准数据集

Published 2026-04-07 09:15Recent activity 2026-04-07 15:19Estimated read 5 min

VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)

Section 01

[Introduction] VAPO and SlideASR-Bench: An End-to-End Slide Speech Recognition Solution to Address Visual Interference in Multimodal Large Language Models (OLLMs)

The ACL 2026 main conference paper proposes the Visually-Anchored Policy Optimization (VAPO) method, which resolves the visual interference problem of multimodal large language models (OLLMs) in slide speech recognition through a "look first, listen later" reasoning chain, and opensources the SlideASR-Bench benchmark dataset, effectively improving the performance of key tasks such as technical term recognition.

Section 02

Research Background and Core Problem: The Dilemma of Visual Interference in Slide Speech Recognition

Research Background

In scenarios such as modern meetings and academic speeches, speech recognition for slide-assisted presentations needs to integrate audio and visual information. However, multimodal large language models (OLLMs) have a visual interference problem: the model tends to copy text from slides instead of transcribing the actual speech, leading to "hallucinations" (e.g., the slide shows "deep learning" but the model transcribes that content even if the speaker says "machine learning").

Root Cause of Interference

The visual-language pre-training of OLLMs forms a visual priority tendency, which conflicts with the task goal of "faithfully transcribing speech".

Section 03

VAPO Method: Innovative Idea of Visually Anchored Policy Optimization

The core of VAPO (Visually Anchored Policy Optimization) is to reshape the reasoning chain into "look first, listen later":

Temporal Decoupling Strategy: First extract visual priors as semantic anchors, then combine with audio to generate transcriptions;
Multi-Objective Reinforcement Learning Optimization: Balance the assistance of visual information and audio fidelity, alleviate interference while improving the performance of entity recognition (especially technical terms).

Section 04

SlideASR-Bench: A Comprehensive Benchmark Dataset for Slide Speech Recognition

To address the scarcity of entity-rich data, the team built SlideASR-Bench:

Synthetic Corpus (SlideASR-S): Precisely control content distribution, noise, etc., for model training;
Real Test Set (SlideASR-R): Derived from actual speeches to evaluate performance in real scenarios; The dataset has been opensourced on Hugging Face.

Section 05

Experimental Validation: Significant Improvements of VAPO in Performance and Interference Mitigation

Experimental results show the advantages of VAPO:

End-to-End Performance: Reduced Word Error Rate (WER) and improved entity-level F1 score;
Interference Mitigation: Significant decrease in the frequency of visual hallucinations;
Domain Adaptability: Enhanced recognition capability for technical terms in professional fields (such as medicine and law).

Section 06

Open Source Contributions and Application Prospects: Value from Tools to Real-World Scenarios

Open Source Contributions

Models: 3B/7B parameter VAPO models opensourced on Hugging Face;
Tools: Complete training/evaluation code, preprocessing scripts, etc., supporting reproduction and extension.

Application Prospects

Scenarios such as online education (automatic subtitles), corporate meetings (intelligent minutes), and accessible access (hearing-impaired assistance); technically, it provides new ideas for the modal competition problem in multimodal fusion.

Section 07

Conclusion: Laying the Foundation for Multimodal Speech Recognition Research

VAPO solves the visual interference problem through innovative strategies, and combined with the SlideASR-Bench dataset, it provides a solid foundation for subsequent research in the field of slide speech recognition, promoting the development of multimodal AI technology in related applications.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15