Zing Forum

Reading

VideoRouter: Dual-Route Framework Enables Efficient Long Video Understanding with 67.9% Token Reduction

VideoRouter employs a dual-route mechanism consisting of semantic routing and image routing to adaptively allocate visual token budgets based on queries. It preserves high-resolution details in key evidence frames while aggressively compressing irrelevant frames, achieving up to 67.9% token reduction on benchmarks like VideoMME.

VideoRouter长视频理解视觉Token压缩查询自适应多模态模型InternVL视频问答Token预算
Published 2026-05-07 16:23Recent activity 2026-05-08 11:55Estimated read 7 min
VideoRouter: Dual-Route Framework Enables Efficient Long Video Understanding with 67.9% Token Reduction
1

Section 01

VideoRouter Core Guide: Dual-Route Framework Solves Long Video Token Crisis with 67.9% Token Reduction

Long video understanding faces a scalability bottleneck due to the explosion of visual token sequences. VideoRouter uses a dual-route mechanism (semantic routing and image routing) to adaptively allocate visual token budgets based on queries. It preserves high-resolution details in key evidence frames while aggressively compressing irrelevant frames, achieving up to 67.9% token reduction on benchmarks like VideoMME while maintaining or even improving understanding accuracy.

2

Section 02

Visual Token Crisis in Long Video Understanding and Limitations of Existing Methods

Root Cause of the Problem

Long videos contain hundreds to thousands of frames, which convert to visual token sequences of tens of thousands or even hundreds of thousands in length. This leads to quadratic growth in memory and computational complexity of Transformer architectures, often exceeding context window limits.

Limitations of Existing Methods

  • Weak query awareness: No knowledge of user questions during encoding, so unified compression strategies cannot be optimized;
  • Fixed compression strategies: Applying the same strategy to all frames ignores the uneven temporal distribution of visual evidence;
  • Information loss: Aggressive compression easily loses key details, reducing answer accuracy.
3

Section 03

VideoRouter's Dual-Route Framework and Training Data Construction

Dual-Route Mechanism

  • Semantic Router: Macro selection strategy (broad temporal coverage/adaptive high-resolution preservation) predicted based on query semantic features;
  • Image Router: Micro frame selection, using early LLM layers to evaluate frame-query relevance and handle high/low relevance frames differently.

Budget-Constrained Allocation

Dynamically allocate token budgets—key frames get more budget, with adaptive resolution based on importance and intelligent temporal sampling.

Training Data

  • Video-QTR-10K: 10K video-query pairs with annotations of optimal allocation strategies;
  • Video-FLR-200K: 200K video-query pairs with frame-level relevance score annotations.
4

Section 04

Experimental Results: 67.9% Token Reduction and Performance Preservation

Benchmark Datasets

VideoMME (comprehensive), MLVU (multilingual), LongVideoBench (ultra-long videos).

Core Results

  • Token reduction: Up to 67.9%;
  • Accuracy: Comparable to or better than baseline InternVL;
  • Reduced latency and improved memory efficiency.

Baseline Comparison

Outperforms unified sampling, heuristic compression, and end-to-end learning baselines. The query-adaptive strategy is more accurate and interpretable.

5

Section 05

Technical Depth: Key Reasons for Dual-Route Effectiveness

Advantages of Hierarchical Decision-Making

Decouples complexity, strong interpretability, modular design for easy optimization.

Value of Early LLM Layers

High computational efficiency, rich semantics, consistent with downstream task standards.

Budget-Constrained Optimization

Predictable resources, guaranteed service quality, clear optimization objectives.

6

Section 06

Practical Application Scenarios of VideoRouter

  • Video Q&A: Dynamically adjust strategies based on questions (overall process/details);
  • Content moderation: Quickly filter irrelevant content and analyze suspicious segments in detail;
  • Educational video analysis: Locate relevant segments, generate summaries, support adaptive learning;
  • Surveillance video retrieval: Quickly retrieve events, locate key frames, support natural language interaction.
7

Section 07

Limitations and Future Research Directions

Limitations

Limited training data scale, insufficient multimodal fusion, lack of online learning capability, ultra-long video processing to be optimized, weak causal reasoning support.

Future Directions

Explore efficient visual encoders, hierarchical video representations, domain-specific routing strategies, extend to modalities like long documents/audio.

8

Section 08

Conclusion: Insights from Intelligent Token Allocation

VideoRouter proves that intelligent token allocation strategies are significantly better than unified compression, reducing tokens by 67.9% while maintaining accuracy. This achievement is of great significance to the video understanding field and also provides insights for other long-sequence AI applications: proactively and intelligently allocate resources rather than passively accept the challenges of long sequences. It will become a key infrastructure for processing video data in the future.