Section 01
【Introduction】RAVE: A Lightweight Solution for Optimizing Visual Attention in Multimodal Models
RAVE is a lightweight pairwise gating mechanism that addresses the uneven visual attention allocation problem in Large Multimodal Models (LMMs) by adding learnable query-key biases to the pre-softmax attention scores of visual keys. This mechanism does not require modifying the backbone architecture, can be trained end-to-end, adds almost no inference overhead, and achieves an average improvement of 3 percentage points across multiple multimodal benchmarks—especially excelling in perception-intensive tasks.