Zing Forum

Reading

iLLaVA: Compress Visual Tokens of Multimodal Large Models to Below 1/3, Accepted by ICLR 2026

The Tianjin University team proposes the iLLaVA method, which achieves end-to-end acceleration by recursively merging redundant visual tokens in both the visual encoder and LLM stages. It doubles throughput, reduces prefill time by 4x, while maintaining model performance.

多模态大模型视觉语言模型token压缩模型加速ICLR 2026Qwen3-VL视觉编码器优化
Published 2026-03-28 18:44Recent activity 2026-03-28 18:49Estimated read 8 min
iLLaVA: Compress Visual Tokens of Multimodal Large Models to Below 1/3, Accepted by ICLR 2026
1

Section 01

Introduction: iLLaVA—End-to-End Optimization of Multimodal Large Model Efficiency, Accepted by ICLR 2026

The Tianjin University team proposes the iLLaVA method, which achieves end-to-end acceleration by recursively merging redundant visual tokens in both the visual encoder and LLM stages: it doubles throughput, reduces prefill time by 4x, while maintaining model performance. This research has been accepted by ICLR 2026, and the code is open-sourced.

2

Section 02

Research Background and Motivation

Large Vision-Language Models (LVLMs) have made significant progress, but high redundancy in visual inputs limits their efficiency. Existing acceleration methods mostly focus on reducing image tokens in the LLM stage, but ignore the visual encoder as a computational bottleneck. The visual encoder is the main source of input tokens for the LLM; reducing redundancy in the encoder stage can accelerate the encoder itself and reduce the LLM's load. Based on this, the Tianjin University team proposes the iLLaVA method, aiming to jointly optimize the visual encoder and LLM for end-to-end acceleration.

3

Section 03

Core Method: Recursive Token Merging and Information Recovery Mechanism

Visual Encoder Stage (ViT Stage)

By default, tokens are merged at layers 5,6,7,8, with a retention ratio of 0.85 per layer, reducing the number of visual tokens entering the LLM from the source.

LLM Stage

By default, tokens are merged at layers 19,21,23,25, with a retention ratio of 0.9, further compressing visual information.

Information Recovery Mechanism

When merging, useful information is extracted from discarded tokens and integrated into retained tokens, ensuring no loss of key visual information—this is the key to maintaining performance.

4

Section 04

Technical Implementation and Parameter Configuration

iLLaVA is implemented based on Qwen3-VL and LLaVA-OneVision, providing flexible configuration options:

  • enable_illava_vit: Whether to enable ViT stage merging (default True)
  • illava_vit_k: ViT merging layers (default "5-6-7-8")
  • illava_vit_r: Retention ratio per ViT layer (default 0.85)
  • illava_vit_mode: ViT merging mode (default 3, clustering based on Pv^i/Pv^c)
  • enable_illava_llm: Whether to enable LLM stage merging (default True)
  • illava_llm_k: LLM merging layers (default "19-21-23-25")
  • illava_llm_r: Retention ratio per LLM layer (default 0.9)
  • illava_llm_mode: LLM merging mode (default 3)

Users can adjust the strategy according to the scenario to balance efficiency and performance.

5

Section 05

Experimental Results: Significant Efficiency Improvement and Performance Preservation

Efficiency Improvement

  • Throughput increased by 2x
  • Prefill time reduced by 4x
  • Memory usage reduced by 1.7 to 2x

Performance Preservation

Token compression maintains accuracy comparable to the original model; larger models (e.g., InternVL-2.5 26B) optimized with iLLaVA outperform smaller models (e.g., InternVL-2.5 8B) in both accuracy and efficiency, breaking the traditional trade-off.

Benchmark Coverage

Supports multiple benchmarks for image understanding (MMMU, MME, etc.) and video understanding (Video-MME, InternVid, etc.).

6

Section 06

Comparison with Existing Methods and Visualization Tools

Compared to existing methods like FastV, iLLaVA's two-stage joint optimization strategy comprehensively compresses visual information from source to end, leading to more significant efficiency improvements. Additionally, iLLaVA provides visualization tools that allow intuitive observation of the token merging process, providing insights for future research.

7

Section 07

Practical Applications and Deployment Support

iLLaVA provides complete deployment support:

  • Fast Inference: run_inference_once_qwen3vl.py supports single/multiple image and video inference
  • Offline Demo: demo_qwen3vl.py provides a Gradio interface, default listening on port 7862
  • Multi-GPU Support: Multi-card parallel inference via torchrun
  • Model Compatibility: The main branch supports Qwen3-VL; there are also Qwen2-VL and LLaVA-OneVision branches

Project Link: https://github.com/hulianyuyy/iLLaVA Paper Link: https://arxiv.org/abs/2412.06263

8

Section 08

Research Significance, Summary, and Future Outlook

Research Significance

  • Theoretical Aspect: Reveals the key role of the visual encoder in LVLMs efficiency optimization, proves the necessity of end-to-end joint optimization, and provides new ideas for architecture design.
  • Practical Aspect: Provides a feasible solution for deploying LVLMs in resource-constrained environments (mobile, edge computing), significantly improving user experience.

Summary

iLLaVA achieves end-to-end acceleration through two-stage recursive merging of redundant visual tokens; the information recovery mechanism ensures performance, and flexible configuration and deployment support make it suitable for practical applications.

Future Outlook

As multimodal model applications expand, similar compression technologies will become more important; open-source code and visualization tools lay the foundation for further exploration by the community.