Zing Forum

Reading

CLEAR Framework: Enabling Multimodal Large Models to 'See Clearly' Even Under Blur, Noise, and Low Light

This article introduces the CLEAR framework, which addresses the problem of unified multimodal models' understanding ability in image degradation scenarios through joint optimization of generation and understanding.

多模态模型图像退化图像修复生成模型CLEAR框架计算机视觉人工智能
Published 2026-04-06 23:54Recent activity 2026-04-07 15:58Estimated read 7 min
CLEAR Framework: Enabling Multimodal Large Models to 'See Clearly' Even Under Blur, Noise, and Low Light
1

Section 01

[Introduction] CLEAR Framework: Enabling Multimodal Large Models to 'See Clearly' Even in Degraded Images

[Introduction] CLEAR Framework: Enabling Multimodal Large Models to 'See Clearly' Even in Degraded Images

This article introduces the CLEAR framework, which addresses the problem of unified multimodal models' understanding ability in image degradation scenarios such as blur, noise, and low light through joint optimization of generation and understanding. The framework connects generation and understanding through three steps. Experimental results show a significant improvement in performance on degraded images without affecting the normal performance on clear images, indicating broad practical application prospects.

2

Section 02

[Background] The Dilemma of Degraded Image Understanding for Multimodal Models

[Background] The Dilemma of Degraded Image Understanding for Multimodal Models

In the real world, images often suffer from degradation issues like blur, noise, and low light. Current multimodal large models experience a sharp decline in understanding ability on such images. Although unified multimodal models integrate image understanding and generation capabilities, they fail to unleash their potential to handle degraded images due to the lack of a training paradigm (not utilizing generation capabilities) and architectural gaps (information loss during decoding and re-encoding).

3

Section 03

[Method] Three Key Steps of the CLEAR Framework

[Method] Three Key Steps of the CLEAR Framework

The CLEAR framework achieves joint optimization of generation and understanding through three steps:

  1. Supervised Fine-tuning: Build a degraded image dataset and train the model to establish an inference pattern of "repair first, then understand";
  2. Latent Representation Bridge: Use a lightweight bridging module to directly convert the latent representation of the generation module into features for the understanding module, avoiding encoding-decoding losses and inefficiencies;
  3. Interleaved GRPO Reinforcement Learning: Simultaneously optimize the visual quality of generation and the correctness of answers to form a positive cycle.
4

Section 04

[Evidence] MMD-Bench Evaluation and Experimental Results

[Evidence] MMD-Bench Evaluation and Experimental Results

The research team built the MMD-Bench evaluation benchmark, covering 3 degradation levels and 6 multimodal tasks. Experimental results show:

  • 15-20% accuracy improvement in mild degradation scenarios;
  • 25-35% improvement in moderate degradation;
  • Still maintains relative advantages in severe degradation; And it does not compromise performance on clear images at all.
5

Section 05

[In-depth Analysis] Alignment Between Task-Driven Optimization and Visual Quality

[In-depth Analysis] Alignment Between Task-Driven Optimization and Visual Quality

Ablation experiments found that after removing pixel-level reconstruction supervision, the perceived quality of the intermediate visual states generated by the model is higher. This indicates that in degraded image repair, task-driven optimization and visual quality are naturally aligned, and the model should generate content that "aids understanding" rather than pixel-by-pixel replication.

6

Section 06

[Application Prospects] Practical Application Scenarios of the CLEAR Framework

[Application Prospects] Practical Application Scenarios of the CLEAR Framework

CLEAR can be applied to:

  • Autonomous driving: Improve the reliability of in-vehicle image understanding in rain/fog or at night;
  • Medical imaging: Assist diagnostic systems in processing low-quality medical images;
  • Security monitoring: Enhance the recognition ability of blurry surveillance images;
  • Digitalization of historical archives: Better understand old photos/documents.
7

Section 07

[Conclusion and Outlook] Future Directions of Generation-Understanding Collaboration

[Conclusion and Outlook] Future Directions of Generation-Understanding Collaboration

The significance of the CLEAR framework lies in integrating generation and understanding capabilities, allowing AI to actively "reconstruct" images before understanding—similar to human cognition. Future directions can explore more complex degradation types, video scenarios, cross-modal transfer, etc., to promote the development of multimodal AI.