# Tango: A New Token Pruning Framework for Faster and More Accurate Video Large Models

> Tango achieves 1.88x inference speedup while retaining 98.9% of performance with only 10% video tokens preserved, thanks to diversity-driven attention selection and Spatio-Temporal Rotary Position Encoding (ST-RoPE).

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-10T17:59:56.000Z
- 最近活动: 2026-04-13T02:50:57.831Z
- 热度: 83.2
- 关键词: Video LLM, token pruning, attention mechanism, efficient inference, multimodal AI, Tango, visual understanding
- 页面链接: https://www.zingnex.cn/en/forum/thread/tango
- Canonical: https://www.zingnex.cn/forum/thread/tango
- Markdown 来源: floors_fallback

---

## [Main Floor] Tango Framework: A New Breakthrough in Efficient Inference for Video Large Models

Tango is a token pruning framework proposed to address efficiency issues in video large models. Its core innovations include a diversity-driven attention selection strategy and Spatio-Temporal Rotary Position Encoding (ST-RoPE). When only 10% of video tokens are retained, it maintains 98.9% of the original performance and achieves a 1.88x inference speedup, providing a new path for efficient inference of video large models.

## Background: Efficiency Dilemma of Video Large Models and Token Pruning Technology

Video Large Language Models (Video LLMs) have outstanding capabilities, but the spatio-temporal characteristics of videos lead to explosive token sequence lengths, slow inference, and high memory usage. Token pruning is a mainstream solution, whose core is to select key tokens to reduce computational load. Existing approaches include attention-based selection and similarity clustering.

## Two Major Limitations of Existing Token Pruning Methods

The Tango team identified shortcomings in existing strategies: 1. Traditional top-k attention selection tends to miss information-complementary regions, leading to incomplete understanding; 2. Similarity clustering easily generates fragmented small clusters, and the distorted representation after pooling affects subsequent tasks.

## Two Key Innovations of the Tango Framework

To address these issues, Tango proposes: 1. Diversity-driven attention selection: balancing scores and regional diversity to cover different spatio-temporal segments; 2. Spatio-Temporal Rotary Position Encoding (ST-RoPE): explicitly modeling spatio-temporal continuity and preserving the original geometric structure.

## Experimental Validation: Balance Between Efficiency and Accuracy

In mainstream Video LLM architectures and benchmark tests, when Tango retains 10% of tokens, the LLaVA-OV model maintains 98.9% performance and achieves a 1.88x speedup. It is practical for real-time video applications (such as live Q&A) and has good generality.

## Technical Insights and Future Outlook

Tango emphasizes the balance between comprehensive information diversity, geometric structure, and efficiency. Its pruning approach is expected to become a standard component for video large model deployment, helping researchers and engineers optimize architecture choices.
