# iLLaVA: Compress Visual Tokens of Multimodal Large Models to Below 1/3, Accepted by ICLR 2026

> The Tianjin University team proposes the iLLaVA method, which achieves end-to-end acceleration by recursively merging redundant visual tokens in both the visual encoder and LLM stages. It doubles throughput, reduces prefill time by 4x, while maintaining model performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-28T10:44:16.000Z
- 最近活动: 2026-03-28T10:49:22.894Z
- 热度: 157.9
- 关键词: 多模态大模型, 视觉语言模型, token压缩, 模型加速, ICLR 2026, Qwen3-VL, 视觉编码器优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/illava-token1-3-iclr-2026
- Canonical: https://www.zingnex.cn/forum/thread/illava-token1-3-iclr-2026
- Markdown 来源: floors_fallback

---

## Introduction: iLLaVA—End-to-End Optimization of Multimodal Large Model Efficiency, Accepted by ICLR 2026

The Tianjin University team proposes the iLLaVA method, which achieves end-to-end acceleration by recursively merging redundant visual tokens in both the visual encoder and LLM stages: it doubles throughput, reduces prefill time by 4x, while maintaining model performance. This research has been accepted by ICLR 2026, and the code is open-sourced.

## Research Background and Motivation

Large Vision-Language Models (LVLMs) have made significant progress, but high redundancy in visual inputs limits their efficiency. Existing acceleration methods mostly focus on reducing image tokens in the LLM stage, but ignore the visual encoder as a computational bottleneck. The visual encoder is the main source of input tokens for the LLM; reducing redundancy in the encoder stage can accelerate the encoder itself and reduce the LLM's load. Based on this, the Tianjin University team proposes the iLLaVA method, aiming to jointly optimize the visual encoder and LLM for end-to-end acceleration.

## Core Method: Recursive Token Merging and Information Recovery Mechanism

### Visual Encoder Stage (ViT Stage)
By default, tokens are merged at layers 5,6,7,8, with a retention ratio of 0.85 per layer, reducing the number of visual tokens entering the LLM from the source.

### LLM Stage
By default, tokens are merged at layers 19,21,23,25, with a retention ratio of 0.9, further compressing visual information.

### Information Recovery Mechanism
When merging, useful information is extracted from discarded tokens and integrated into retained tokens, ensuring no loss of key visual information—this is the key to maintaining performance.

## Technical Implementation and Parameter Configuration

iLLaVA is implemented based on Qwen3-VL and LLaVA-OneVision, providing flexible configuration options:
- enable_illava_vit: Whether to enable ViT stage merging (default True)
- illava_vit_k: ViT merging layers (default "5-6-7-8")
- illava_vit_r: Retention ratio per ViT layer (default 0.85)
- illava_vit_mode: ViT merging mode (default 3, clustering based on Pv^i/Pv^c)
- enable_illava_llm: Whether to enable LLM stage merging (default True)
- illava_llm_k: LLM merging layers (default "19-21-23-25")
- illava_llm_r: Retention ratio per LLM layer (default 0.9)
- illava_llm_mode: LLM merging mode (default 3)

Users can adjust the strategy according to the scenario to balance efficiency and performance.

## Experimental Results: Significant Efficiency Improvement and Performance Preservation

### Efficiency Improvement
- Throughput increased by 2x
- Prefill time reduced by 4x
- Memory usage reduced by 1.7 to 2x

### Performance Preservation
Token compression maintains accuracy comparable to the original model; larger models (e.g., InternVL-2.5 26B) optimized with iLLaVA outperform smaller models (e.g., InternVL-2.5 8B) in both accuracy and efficiency, breaking the traditional trade-off.

### Benchmark Coverage
Supports multiple benchmarks for image understanding (MMMU, MME, etc.) and video understanding (Video-MME, InternVid, etc.).

## Comparison with Existing Methods and Visualization Tools

Compared to existing methods like FastV, iLLaVA's two-stage joint optimization strategy comprehensively compresses visual information from source to end, leading to more significant efficiency improvements. Additionally, iLLaVA provides visualization tools that allow intuitive observation of the token merging process, providing insights for future research.

## Practical Applications and Deployment Support

iLLaVA provides complete deployment support:
- **Fast Inference**: `run_inference_once_qwen3vl.py` supports single/multiple image and video inference
- **Offline Demo**: `demo_qwen3vl.py` provides a Gradio interface, default listening on port 7862
- **Multi-GPU Support**: Multi-card parallel inference via torchrun
- **Model Compatibility**: The main branch supports Qwen3-VL; there are also Qwen2-VL and LLaVA-OneVision branches

Project Link: https://github.com/hulianyuyy/iLLaVA
Paper Link: https://arxiv.org/abs/2412.06263

## Research Significance, Summary, and Future Outlook

### Research Significance
- **Theoretical Aspect**: Reveals the key role of the visual encoder in LVLMs efficiency optimization, proves the necessity of end-to-end joint optimization, and provides new ideas for architecture design.
- **Practical Aspect**: Provides a feasible solution for deploying LVLMs in resource-constrained environments (mobile, edge computing), significantly improving user experience.

### Summary
iLLaVA achieves end-to-end acceleration through two-stage recursive merging of redundant visual tokens; the information recovery mechanism ensures performance, and flexible configuration and deployment support make it suitable for practical applications.

### Future Outlook
As multimodal model applications expand, similar compression technologies will become more important; open-source code and visualization tools lay the foundation for further exploration by the community.
