Reading

iLLaVA: Compress Visual Tokens of Multimodal Large Models to Below 1/3, Accepted by ICLR 2026

The Tianjin University team proposes the iLLaVA method, which achieves end-to-end acceleration by recursively merging redundant visual tokens in both the visual encoder and LLM stages. It doubles throughput, reduces prefill time by 4x, while maintaining model performance.

多模态大模型视觉语言模型token压缩模型加速ICLR 2026Qwen3-VL视觉编码器优化

Published 2026-03-28 18:44Recent activity 2026-03-28 18:49Estimated read 8 min

iLLaVA: Compress Visual Tokens of Multimodal Large Models to Below 1/3, Accepted by ICLR 2026

Section 01

Introduction: iLLaVA—End-to-End Optimization of Multimodal Large Model Efficiency, Accepted by ICLR 2026

The Tianjin University team proposes the iLLaVA method, which achieves end-to-end acceleration by recursively merging redundant visual tokens in both the visual encoder and LLM stages: it doubles throughput, reduces prefill time by 4x, while maintaining model performance. This research has been accepted by ICLR 2026, and the code is open-sourced.

Section 02

Research Background and Motivation

Large Vision-Language Models (LVLMs) have made significant progress, but high redundancy in visual inputs limits their efficiency. Existing acceleration methods mostly focus on reducing image tokens in the LLM stage, but ignore the visual encoder as a computational bottleneck. The visual encoder is the main source of input tokens for the LLM; reducing redundancy in the encoder stage can accelerate the encoder itself and reduce the LLM's load. Based on this, the Tianjin University team proposes the iLLaVA method, aiming to jointly optimize the visual encoder and LLM for end-to-end acceleration.

Section 03

Core Method: Recursive Token Merging and Information Recovery Mechanism

Visual Encoder Stage (ViT Stage)

By default, tokens are merged at layers 5,6,7,8, with a retention ratio of 0.85 per layer, reducing the number of visual tokens entering the LLM from the source.

LLM Stage

By default, tokens are merged at layers 19,21,23,25, with a retention ratio of 0.9, further compressing visual information.

Information Recovery Mechanism

When merging, useful information is extracted from discarded tokens and integrated into retained tokens, ensuring no loss of key visual information—this is the key to maintaining performance.

Section 04

Technical Implementation and Parameter Configuration

iLLaVA is implemented based on Qwen3-VL and LLaVA-OneVision, providing flexible configuration options:

enable_illava_vit: Whether to enable ViT stage merging (default True)
illava_vit_k: ViT merging layers (default "5-6-7-8")
illava_vit_r: Retention ratio per ViT layer (default 0.85)
illava_vit_mode: ViT merging mode (default 3, clustering based on Pv^i/Pv^c)
enable_illava_llm: Whether to enable LLM stage merging (default True)
illava_llm_k: LLM merging layers (default "19-21-23-25")
illava_llm_r: Retention ratio per LLM layer (default 0.9)
illava_llm_mode: LLM merging mode (default 3)

Users can adjust the strategy according to the scenario to balance efficiency and performance.

Section 05

Experimental Results: Significant Efficiency Improvement and Performance Preservation

Efficiency Improvement

Throughput increased by 2x
Prefill time reduced by 4x
Memory usage reduced by 1.7 to 2x

Performance Preservation

Token compression maintains accuracy comparable to the original model; larger models (e.g., InternVL-2.5 26B) optimized with iLLaVA outperform smaller models (e.g., InternVL-2.5 8B) in both accuracy and efficiency, breaking the traditional trade-off.

Benchmark Coverage

Supports multiple benchmarks for image understanding (MMMU, MME, etc.) and video understanding (Video-MME, InternVid, etc.).

Section 06

Comparison with Existing Methods and Visualization Tools

Compared to existing methods like FastV, iLLaVA's two-stage joint optimization strategy comprehensively compresses visual information from source to end, leading to more significant efficiency improvements. Additionally, iLLaVA provides visualization tools that allow intuitive observation of the token merging process, providing insights for future research.

Section 07

Practical Applications and Deployment Support

iLLaVA provides complete deployment support:

Fast Inference: run_inference_once_qwen3vl.py supports single/multiple image and video inference
Offline Demo: demo_qwen3vl.py provides a Gradio interface, default listening on port 7862
Multi-GPU Support: Multi-card parallel inference via torchrun
Model Compatibility: The main branch supports Qwen3-VL; there are also Qwen2-VL and LLaVA-OneVision branches

Project Link: https://github.com/hulianyuyy/iLLaVA Paper Link: https://arxiv.org/abs/2412.06263

Section 08

Research Significance, Summary, and Future Outlook

Research Significance

Theoretical Aspect: Reveals the key role of the visual encoder in LVLMs efficiency optimization, proves the necessity of end-to-end joint optimization, and provides new ideas for architecture design.
Practical Aspect: Provides a feasible solution for deploying LVLMs in resource-constrained environments (mobile, edge computing), significantly improving user experience.

Summary

iLLaVA achieves end-to-end acceleration through two-stage recursive merging of redundant visual tokens; the information recovery mechanism ensures performance, and flexible configuration and deployment support make it suitable for practical applications.

Future Outlook

As multimodal model applications expand, similar compression technologies will become more important; open-source code and visualization tools lay the foundation for further exploration by the community.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15