Zing Forum

Reading

Tango: A New Token Pruning Framework for Faster and More Accurate Video Large Models

Tango achieves 1.88x inference speedup while retaining 98.9% of performance with only 10% video tokens preserved, thanks to diversity-driven attention selection and Spatio-Temporal Rotary Position Encoding (ST-RoPE).

Video LLMtoken pruningattention mechanismefficient inferencemultimodal AITangovisual understanding
Published 2026-04-11 01:59Recent activity 2026-04-13 10:50Estimated read 4 min
Tango: A New Token Pruning Framework for Faster and More Accurate Video Large Models
1

Section 01

[Main Floor] Tango Framework: A New Breakthrough in Efficient Inference for Video Large Models

Tango is a token pruning framework proposed to address efficiency issues in video large models. Its core innovations include a diversity-driven attention selection strategy and Spatio-Temporal Rotary Position Encoding (ST-RoPE). When only 10% of video tokens are retained, it maintains 98.9% of the original performance and achieves a 1.88x inference speedup, providing a new path for efficient inference of video large models.

2

Section 02

Background: Efficiency Dilemma of Video Large Models and Token Pruning Technology

Video Large Language Models (Video LLMs) have outstanding capabilities, but the spatio-temporal characteristics of videos lead to explosive token sequence lengths, slow inference, and high memory usage. Token pruning is a mainstream solution, whose core is to select key tokens to reduce computational load. Existing approaches include attention-based selection and similarity clustering.

3

Section 03

Two Major Limitations of Existing Token Pruning Methods

The Tango team identified shortcomings in existing strategies: 1. Traditional top-k attention selection tends to miss information-complementary regions, leading to incomplete understanding; 2. Similarity clustering easily generates fragmented small clusters, and the distorted representation after pooling affects subsequent tasks.

4

Section 04

Two Key Innovations of the Tango Framework

To address these issues, Tango proposes: 1. Diversity-driven attention selection: balancing scores and regional diversity to cover different spatio-temporal segments; 2. Spatio-Temporal Rotary Position Encoding (ST-RoPE): explicitly modeling spatio-temporal continuity and preserving the original geometric structure.

5

Section 05

Experimental Validation: Balance Between Efficiency and Accuracy

In mainstream Video LLM architectures and benchmark tests, when Tango retains 10% of tokens, the LLaVA-OV model maintains 98.9% performance and achieves a 1.88x speedup. It is practical for real-time video applications (such as live Q&A) and has good generality.

6

Section 06

Technical Insights and Future Outlook

Tango emphasizes the balance between comprehensive information diversity, geometric structure, and efficiency. Its pruning approach is expected to become a standard component for video large model deployment, helping researchers and engineers optimize architecture choices.