Reading

SEATS: Phased Adaptive Token Selection Optimization for Multimodal Large Language Models

To address the computational overhead issue of multimodal large language models (om-LLMs) when processing dense non-text tokens, researchers propose the SEATS phased adaptive token selection method. By analyzing the block-level decay pattern of inter-layer token dependencies, SEATS achieves training-agnostic efficient inference. When retaining only 10% of audio-visual tokens, it reduces FLOPs by 9.3x while maintaining 96.3% of the performance.

全模态大语言模型token剪枝推理效率多模态融合注意力机制训练无关优化视觉语言模型

Published 2026-05-19 23:55Recent activity 2026-05-20 10:52Estimated read 6 min

SEATS: Phased Adaptive Token Selection Optimization for Multimodal Large Language Models

Section 01

Introduction: SEATS—An Efficient Inference Optimization Scheme for Multimodal Large Language Models

Multimodal large language models (om-LLMs) can understand video, audio, and text simultaneously, but they incur huge computational overhead when processing dense non-text tokens. Researchers propose the SEATS phased adaptive token selection method, which achieves training-agnostic efficient inference based on the block-level decay pattern of inter-layer token dependencies. When retaining only 10% of audio-visual tokens, SEATS reduces FLOPs by 9.3x while maintaining 96.3% performance, providing a key optimization for the practical deployment of om-LLMs.

Section 02

Computational Dilemma of Multimodal Large Language Models

om-LLMs encode video and audio into time-series tokens and interleave them with text inputs to achieve deep fusion, but this brings computational challenges: the number of tokens for a 10-second video (30fps) plus audio and text can reach thousands, making the computation and memory usage unbearable when processed through dozens of Transformer layers. Existing token pruning methods have limitations: visual centrism (ignoring the uniqueness of audio), static strategies (unable to adapt to inter-layer dynamics), and neglect of the evolutionary laws of cross-modal fusion.

Section 03

Theoretical Basis of SEATS: Block-Level Decay Pattern of Inter-Layer Dependencies

Researchers found that the inter-layer dependencies of visual and audio tokens in om-LLMs exhibit a block-level pattern and decay with depth: more tokens need to be retained in shallow layers to support cross-modal alignment; in middle layers, modal information is gradually fused, so redundant tokens can be removed; after cross-modal fusion is completed in deep layers, most non-text tokens can be discarded. This provides a theoretical basis for dynamically adjusting the token retention strategy (conservative in shallow layers, progressive in middle layers, and aggressive in deep layers).

Section 04

Three-Stage Adaptive Token Selection Architecture of SEATS

SEATS is a training-agnostic phased framework: 1. Pre-input spatio-temporal redundancy elimination: Remove visual (spatio-temporal redundancy) and audio (time-frequency redundancy) tokens via attention-weighted diversity selection; 2. Progressive pruning inside LLM: Dynamically allocate budgets (based on query relevance scores) after each Transformer block to gradually remove unimportant tokens; 3. Full pruning in deep layers: Remove all non-text tokens in the deep layers of LLM (last 1/4 to 1/3 layers), as fused information has been encoded into text representations.

Section 05

Experimental Results of SEATS: Balance Between Efficiency and Performance

Evaluated on Qwen2.5-Omni and Qwen3-Omni: The aggressive configuration (retaining 10% of audio-visual tokens) achieves a 9.3x reduction in FLOPs, a 4.8x acceleration in Prefill, and maintains 96.3% performance; Trade-offs for different retention rates: 50% retention →98.5% performance /3.2x FLOPs reduction, 25%→97.4%/5.8x,5%→92.1%/12.6x. Performance loss is mainly in fine-grained perception tasks, while higher-level semantic tasks have a higher retention rate.

Section 06

Implications of SEATS for Multimodal System Design

Implications of SEATS:1. Break the traditional trade-off between computation and quality, and find better points on the Pareto frontier;2. The training-agnostic feature lowers the integration threshold, enabling plug-and-play;3. Reveal the layered functions of cross-modal fusion (shallow alignment, middle-layer fusion, deep-layer reasoning), providing guidance for model architecture design.

Section 07

Limitations of SEATS and Future Research Directions

Limitations of SEATS: Task sensitivity (more sensitive to fine-grained perception tasks), complex long video processing, and simple audio processing. Future directions: Task-aware pruning (dynamically adjust strategies), hierarchical time modeling (multi-scale time representation), and joint optimization (combining token selection with architecture search).

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15