Zing Forum

Reading

FALCON: Solving Visual Redundancy and Fragmentation in High-Resolution Multimodal Large Models Using Visual Registers

FALCON is a joint work by Harbin Institute of Technology (HIT) and Huawei Noah's Ark Lab accepted by ICCV 2025. It addresses two core issues—visual redundancy and fragmentation—in high-resolution multimodal large language models through an innovative Visual Register technique, achieving a balance between elastic efficiency and robust perception.

多模态大模型高分辨率视觉视觉编码ICCV 2025视觉问答文档理解
Published 2026-04-05 17:32Recent activity 2026-04-05 17:48Estimated read 7 min
FALCON: Solving Visual Redundancy and Fragmentation in High-Resolution Multimodal Large Models Using Visual Registers
1

Section 01

FALCON: Solving Core Issues in High-Resolution Multimodal Large Models Using Visual Registers

FALCON is a joint work by HIT Shenzhen and Huawei Noah's Ark Lab accepted by ICCV 2025. It addresses two core issues—visual redundancy and fragmentation—in high-resolution multimodal large language models through an innovative Visual Register technique, achieving a balance between elastic efficiency and robust perception. The complete code and pre-trained models of this work have been open-sourced.

2

Section 02

The Dilemma of High-Resolution Visual Encoding

Current mainstream multimodal large models face two major problems when processing high-resolution images: visual redundancy (information overlap in high-resolution tokens, wasting computing resources and diluting attention) and visual fragmentation (block-based processing splits continuous objects, breaking semantic coherence). Traditional solutions are trade-offs: token compression alleviates redundancy but exacerbates fragmentation, while retaining full tokens leads to high computational costs.

3

Section 03

Visual Register: An Elastic and Efficient Intermediate Representation

The Visual Register proposed by FALCON is a learnable intermediate representation layer between the visual encoder and the language model, drawing on the cache concept of computer registers. It consists of a fixed number of learnable tokens. Original visual tokens interact with the register via cross-attention to write information into the register—this not only limits computational complexity (solving redundancy) but also aggregates relevant information through adaptive fusion (alleviating fragmentation).

4

Section 04

Dual-Path Information Flow Architecture Design

FALCON adopts a dual-path information flow architecture: the original high-resolution image generates a feature map via the visual encoder → visual tokens are first processed by the register layer (visual tokens act as Query, register tokens as Key/Value to perform cross-attention, extracting information into register tokens) → register tokens are concatenated with text instructions and fed into the language model. The number of registers is adjustable, allowing flexible trade-offs between efficiency and accuracy.

5

Section 05

Experimental Validation: Win-Win of Efficiency and Accuracy

Experimental validation shows that FALCON leads in accuracy and significantly reduces computational overhead in tasks like visual question answering, image-text retrieval, and document understanding: compared to baseline methods, it maintains or improves performance even when the number of visual tokens is compressed by an order of magnitude. It has a notable advantage especially in document understanding tasks, proving its effectiveness in aggregating fragmented information. The project open-sources the 8B-parameter model Falcon-8B (on HuggingFace) and provides a well-encapsulated inference interface JiutianHDInfer to lower the barrier to use.

6

Section 06

Engineering Implementation and Usability

FALCON is built on PyTorch, supports Flash Attention acceleration, and has a clear modular design. The installation process is simple (conda environment), and the inference interface is user-friendly: you can create an instance by specifying the model path and dialogue mode; the inference method accepts image paths and text questions and returns answers, hiding preprocessing details. It also provides training scripts and configuration examples, supporting continued training of the base model or domain adaptation.

7

Section 07

Technical Insights and Future Outlook

The technical route of FALCON reveals: the value of introducing structured intermediate representations in vision-language fusion—Visual Register is not just a compression tool but also an information reorganization mechanism. This idea can be extended to scenarios like temporal redundancy in videos and spatial fragmentation in 3D scenes. Optimization of multimodal models should seek multi-dimensional collaborative optimization such as efficiency and accuracy.

8

Section 08

Summary: Value and Application Scenarios of FALCON

FALCON is an important advancement in the field of high-resolution multimodal large models. It solves both redundancy and fragmentation issues simultaneously through Visual Register, achieving a win-win of efficiency and accuracy. It is suitable for application scenarios requiring high-resolution visual input such as document analysis, medical imaging, and remote sensing image understanding, providing a powerful and practical solution.