Zing Forum

Reading

AdaCodec: Predictive Visual Coding Boosts Video Multimodal Large Model Efficiency by 7x

AdaCodec leverages predictive visual coding technology and video temporal redundancy to transmit full reference frames only when necessary, while using compact P-tokens to describe inter-frame changes at other times, achieving dual breakthroughs in efficiency and performance for video MLLMs.

视频理解多模态大模型视觉编码预测编码效率优化视频MLLMtoken压缩时间冗余
Published 2026-06-02 01:56Recent activity 2026-06-02 13:52Estimated read 8 min
AdaCodec: Predictive Visual Coding Boosts Video Multimodal Large Model Efficiency by 7x
1

Section 01

AdaCodec: Predictive Visual Coding Boosts Video Multimodal Large Model Efficiency by 7x (Introduction)

Core Insights: AdaCodec uses predictive visual coding technology and video temporal redundancy to transmit full reference frames only when necessary, while using compact P-tokens to describe changes at other times, achieving a 7x efficiency boost for video MLLMs without sacrificing performance (and even improving it).

Original Authors & Sources:

  • Research Team: Paper author team (arXiv submission)
  • Source Platform: arXiv
  • Original Title: AdaCodec: A Predictive Visual Code for Video MLLMs
  • Original Link: http://arxiv.org/abs/2606.02569v1
  • Publication Date: June 1, 2026

Keywords: Video understanding, multimodal large models, visual coding, predictive coding, efficiency optimization, video MLLM, token compression, temporal redundancy

2

Section 02

Problem Background: Efficiency Bottlenecks of Video Multimodal Large Models

Video data has inherent temporal redundancy (adjacent frames share objects, backgrounds, etc.), but existing video MLLMs process each frame independently as RGB images, leading to a large number of redundant visual tokens.

Consequences of Inefficient Processing:

  1. Wasted Computing Resources: Redundant tokens occupy valuable computing budgets
  2. Increased Inference Latency: A large number of tokens significantly prolong the time to first token

For example, when sampling multiple frames per second for long videos, the cumulative tokens can reach hundreds of thousands, restricting real-time performance and scalability.

3

Section 03

Core Idea and Technical Architecture: Innovative Application of Predictive Visual Coding

Core Insight: Leverage temporal correlations between video frames to send full reference frames only when the scene is unpredictable; otherwise, transmit compact change descriptions (drawing on inter-frame prediction in video compression but applied to MLLM visual coding).

Technical Architecture:

  1. Conditional Prediction Cost Evaluation: Evaluate prediction error for each frame to decide whether to use a reference frame
  2. Dual-Mode Coding Strategy:
    • Reference Frame Mode: Allocate full visual tokens when prediction cost is high
    • P-token Mode: When the scene is predictable, use P-tokens to describe motion, residuals, and scene changes (volume is much smaller than full tokens)
  3. Seamless LLM Integration: Encoded tokens can be directly input into Transformers without large-scale model modifications.
4

Section 04

Experimental Results: Dual Breakthroughs in Efficiency and Performance

Benchmark Coverage: 11 video understanding benchmarks (long video understanding, general video question answering, fine-grained analysis)

Key Results:

  1. Performance Improvement with Same Budget: Under the same token budget as the Qwen3-VL-8B baseline, performance improved across all 11 benchmarks
  2. Extreme Compression Performance: Using only 1/7 of the budget (32k vs. 224k tokens):
    • Long video benchmarks exceeded the full-budget baseline
    • 5 general video benchmarks maintained or improved performance
    • Time to first token reduced from 9.26 seconds to 1.62 seconds (nearly 6x improvement)

Reasons: Noise filtering, attention focusing, and longer context processing.

5

Section 05

Technical Significance and Application Prospects: From Blind Token Stacking to Intelligent Selection

Domain Impact: Points the way for video MLLMs—shifting from blind token stacking to intelligent information selection, which may inspire follow-up research (fine-grained sampling, adaptive coding, cross-modal compression).

Practical Applications:

  1. Reduce inference costs
  2. Improve response speed
  3. Support longer videos
  4. Make edge deployment more feasible

Comparison with Traditional Video Compression:

Feature Traditional Video Compression AdaCodec
Goal Pixel-level reconstruction Semantic-level understanding
Evaluation Metric PSNR/SSIM Downstream task performance
Information Retention Full fidelity Task-relevant retention
Compression Ratio Fixed Adaptive

Task-oriented compression is the key to success.

6

Section 06

Limitations and Future Directions: Room for Continuous Optimization

Current Limitations:

  1. Highly dynamic scenes may frequently switch to reference frame mode
  2. Dependent on the quality of pre-trained visual encoders
  3. Room for end-to-end optimization

Future Directions:

  1. Hierarchical coding (handling changes at different time scales)
  2. Cross-modal prediction (audio/text-assisted video prediction)
  3. Dynamic budget allocation (adjusted based on task difficulty)
  4. End-to-end learning (joint training of predictor and LLM)
7

Section 07

Summary: A New Paradigm for Video MLLM Efficiency Optimization

AdaCodec elegantly solves the efficiency bottleneck of video MLLMs through predictive visual coding. It proves that deep understanding of the inherent structure of data (temporal redundancy) can significantly improve efficiency without sacrificing performance.

In today's era of explosive video content, AdaCodec is of great significance for lowering the threshold of AI video understanding and promoting the popularization of video AI. We look forward to the arrival of more efficient and intelligent video understanding systems.