# SEATS: Phased Adaptive Token Selection Optimization for Multimodal Large Language Models

> To address the computational overhead issue of multimodal large language models (om-LLMs) when processing dense non-text tokens, researchers propose the SEATS phased adaptive token selection method. By analyzing the block-level decay pattern of inter-layer token dependencies, SEATS achieves training-agnostic efficient inference. When retaining only 10% of audio-visual tokens, it reduces FLOPs by 9.3x while maintaining 96.3% of the performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-19T15:55:16.000Z
- 最近活动: 2026-05-20T02:52:10.134Z
- 热度: 138.1
- 关键词: 全模态大语言模型, token剪枝, 推理效率, 多模态融合, 注意力机制, 训练无关优化, 视觉语言模型
- 页面链接: https://www.zingnex.cn/en/forum/thread/seats-token
- Canonical: https://www.zingnex.cn/forum/thread/seats-token
- Markdown 来源: floors_fallback

---

## Introduction: SEATS—An Efficient Inference Optimization Scheme for Multimodal Large Language Models

Multimodal large language models (om-LLMs) can understand video, audio, and text simultaneously, but they incur huge computational overhead when processing dense non-text tokens. Researchers propose the SEATS phased adaptive token selection method, which achieves training-agnostic efficient inference based on the block-level decay pattern of inter-layer token dependencies. When retaining only 10% of audio-visual tokens, SEATS reduces FLOPs by 9.3x while maintaining 96.3% performance, providing a key optimization for the practical deployment of om-LLMs.

## Computational Dilemma of Multimodal Large Language Models

om-LLMs encode video and audio into time-series tokens and interleave them with text inputs to achieve deep fusion, but this brings computational challenges: the number of tokens for a 10-second video (30fps) plus audio and text can reach thousands, making the computation and memory usage unbearable when processed through dozens of Transformer layers. Existing token pruning methods have limitations: visual centrism (ignoring the uniqueness of audio), static strategies (unable to adapt to inter-layer dynamics), and neglect of the evolutionary laws of cross-modal fusion.

## Theoretical Basis of SEATS: Block-Level Decay Pattern of Inter-Layer Dependencies

Researchers found that the inter-layer dependencies of visual and audio tokens in om-LLMs exhibit a block-level pattern and decay with depth: more tokens need to be retained in shallow layers to support cross-modal alignment; in middle layers, modal information is gradually fused, so redundant tokens can be removed; after cross-modal fusion is completed in deep layers, most non-text tokens can be discarded. This provides a theoretical basis for dynamically adjusting the token retention strategy (conservative in shallow layers, progressive in middle layers, and aggressive in deep layers).

## Three-Stage Adaptive Token Selection Architecture of SEATS

SEATS is a training-agnostic phased framework: 1. Pre-input spatio-temporal redundancy elimination: Remove visual (spatio-temporal redundancy) and audio (time-frequency redundancy) tokens via attention-weighted diversity selection; 2. Progressive pruning inside LLM: Dynamically allocate budgets (based on query relevance scores) after each Transformer block to gradually remove unimportant tokens; 3. Full pruning in deep layers: Remove all non-text tokens in the deep layers of LLM (last 1/4 to 1/3 layers), as fused information has been encoded into text representations.

## Experimental Results of SEATS: Balance Between Efficiency and Performance

Evaluated on Qwen2.5-Omni and Qwen3-Omni: The aggressive configuration (retaining 10% of audio-visual tokens) achieves a 9.3x reduction in FLOPs, a 4.8x acceleration in Prefill, and maintains 96.3% performance; Trade-offs for different retention rates: 50% retention →98.5% performance /3.2x FLOPs reduction, 25%→97.4%/5.8x,5%→92.1%/12.6x. Performance loss is mainly in fine-grained perception tasks, while higher-level semantic tasks have a higher retention rate.

## Implications of SEATS for Multimodal System Design

Implications of SEATS:1. Break the traditional trade-off between computation and quality, and find better points on the Pareto frontier;2. The training-agnostic feature lowers the integration threshold, enabling plug-and-play;3. Reveal the layered functions of cross-modal fusion (shallow alignment, middle-layer fusion, deep-layer reasoning), providing guidance for model architecture design.

## Limitations of SEATS and Future Research Directions

Limitations of SEATS: Task sensitivity (more sensitive to fine-grained perception tasks), complex long video processing, and simple audio processing. Future directions: Task-aware pruning (dynamically adjust strategies), hierarchical time modeling (multi-scale time representation), and joint optimization (combining token selection with architecture search).
