Zing Forum

Reading

RAVE: Reallocating Visual Attention in Large Multimodal Models

This article introduces RAVE, a lightweight pairwise gating mechanism that adds learnable query-key biases to the pre-softmax attention scores of visual keys, achieving an average improvement of 3 percentage points across multiple multimodal benchmarks.

多模态模型注意力机制视觉理解OCRVQA成对门控
Published 2026-05-18 21:12Recent activity 2026-05-19 11:28Estimated read 6 min
RAVE: Reallocating Visual Attention in Large Multimodal Models
1

Section 01

【Introduction】RAVE: A Lightweight Solution for Optimizing Visual Attention in Multimodal Models

RAVE is a lightweight pairwise gating mechanism that addresses the uneven visual attention allocation problem in Large Multimodal Models (LMMs) by adding learnable query-key biases to the pre-softmax attention scores of visual keys. This mechanism does not require modifying the backbone architecture, can be trained end-to-end, adds almost no inference overhead, and achieves an average improvement of 3 percentage points across multiple multimodal benchmarks—especially excelling in perception-intensive tasks.

2

Section 02

Background: Two Major Shortcomings of Standard Attention in Multimodal Scenarios

The standard self-attention mechanism is optimized for pure text scenarios, and when extended to multimodal scenarios, it has two issues:

  1. Cross-modal Misallocation: Incorrect attention weight distribution between text and visual evidence—e.g., tasks relying on vision overly focus on text prompts.
  2. Intra-visual Imbalance: Uneven attention allocation among visual tokens, where key tokens are ignored, affecting precise visual localization tasks.
3

Section 03

Method: Core Design of RAVE—Pairwise Gating Mechanism

The core of RAVE is the pairwise gating mechanism, with steps as follows:

  1. Input query and key features before RoPE (Rotary Position Encoding);
  2. Calculate bias values that reflect the correlation between queries and visual keys;
  3. Add the bias to pre-softmax attention scores to adjust allocation tendencies. Key features: Plug-and-play (no backbone modification needed), end-to-end training, lightweight (few parameters), and only acts on visual keys.
4

Section 04

Experimental Evidence: Significant Improvements of RAVE on Multimodal Benchmarks

  • Overall Performance: An average improvement of 3 percentage points over standard attention;
  • Maximum Gains in Perception-Intensive Tasks:
    • Multilingual OCR: More accurate localization of text regions in images;
    • Chart Understanding: Better focus on key data elements;
    • Document VQA: Find relevant information in complex layouts;
    • Scene Text VQA: Improved scene text localization and understanding capabilities.
5

Section 05

Technical Details: Key Implementation Points of RAVE

  1. Utilization of Pre-RoPE Features: Preserves original semantic information without interference from position encoding, so biases better reflect semantic relevance;
  2. Bias Function Design: Chooses lightweight and flexible neural network structures to balance computational overhead and learning ability;
  3. Training Strategy: End-to-end training on multimodal data, with parameters updated via backpropagation along with other parts of the model.
6

Section 06

Comparison with Related Work: Three Major Advantages of RAVE

Compared with existing multimodal attention improvement methods:

  1. Simplicity: Only adds bias terms, no need to modify the core attention structure or introduce complex modules;
  2. Generality: Does not rely on specific model architectures or training data, applicable to various LMMs;
  3. Efficiency: Minimal parameter and computational overhead, no impact on inference speed.
7

Section 07

Conclusion and Future Directions

Conclusion: RAVE solves the visual attention allocation problem through a concise and effective pairwise gating mechanism, making it a practical improvement solution for multimodal models—with significant value in industrial perception-intensive tasks (e.g., OCR, document understanding). Future Directions:

  1. Extend to cross-modal attention adjustment between text and vision;
  2. Explore dynamic bias strategies (adjust based on input content);
  3. Extend to other modalities like audio and video to build a general multimodal attention framework.