With the rapid development of multimodal large language models such as GPT-4V, Gemini, and Qwen2.5-Omni, AI can now understand text, images, audio, and video simultaneously. However, this capability comes at an enormous computational cost—an input containing a few minutes of video can generate tens of thousands of visual tokens, plus audio tokens, easily exceeding the model's context window limit.
Traditional solutions uniformly compress tokens of all modalities, but this ignores a key fact: different queries depend on audio and video to varying degrees. Some questions mainly require video information, some rely on audio, and others need a combination of both.
How to efficiently compress multimodal tokens while maintaining model performance? The OmniSelect project proposes an innovative dynamic modality-aware compression scheme.