Section 01
Introduction: SEATS—An Efficient Inference Optimization Scheme for Multimodal Large Language Models
Multimodal large language models (om-LLMs) can understand video, audio, and text simultaneously, but they incur huge computational overhead when processing dense non-text tokens. Researchers propose the SEATS phased adaptive token selection method, which achieves training-agnostic efficient inference based on the block-level decay pattern of inter-layer token dependencies. When retaining only 10% of audio-visual tokens, SEATS reduces FLOPs by 9.3x while maintaining 96.3% performance, providing a key optimization for the practical deployment of om-LLMs.