Zing Forum

Reading

ETCHR: Enhancing Visual Reasoning Capabilities of Multimodal Large Models via Image Editing

This article introduces the ETCHR framework, a problem-conditional reasoning-aware image editing model that bridges the gap between language understanding and image editing through two-stage training, significantly enhancing the reasoning capabilities of multimodal large models in tasks such as fine-grained perception, chart understanding, and logical reasoning.

多模态大模型视觉推理图像编辑思维链MLLM解耦架构细粒度感知图表理解逻辑推理AI增强
Published 2026-05-23 01:58Recent activity 2026-05-25 11:54Estimated read 9 min
ETCHR: Enhancing Visual Reasoning Capabilities of Multimodal Large Models via Image Editing
1

Section 01

Core Guide to the ETCHR Framework: Enhancing Visual Reasoning of Multimodal Large Models via Image Editing

Core Guide to the ETCHR Framework

ETCHR is a problem-conditional reasoning-aware image editing model. It bridges the gap between language understanding and image editing through a decoupled architecture (separating the understanding model from the editing model) and a two-stage training scheme, significantly enhancing the capabilities of multimodal large models in tasks like fine-grained perception, chart understanding, and logical reasoning.

  • Source: Published on arXiv on May 22, 2026
  • Core Innovations: Decoupled design + two-stage training
  • Effect: Achieves a 4-5 percentage point improvement in Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5

This article will analyze from dimensions such as background, methodology, experiments, and applications.

2

Section 02

Bottlenecks in Visual Reasoning: Limitations of Pure Text Chain-of-Thought and Existing Solutions

Bottlenecks in Visual Reasoning

Limitations of Pure Text Chain-of-Thought

When humans solve complex visual problems, they manipulate images (zoom, rotate, highlight, etc.) to aid thinking. However, current MLLMs can only passively receive fixed images, and this "read-only" mode limits their ability to handle complex tasks.

Shortcomings of Existing Solutions

  • Fixed Toolset Approach: Fixed toolset, lack of flexibility, unable to generate customized visual aids
  • Unified Multimodal Approach: In end-to-end models, generation and understanding tasks compete for resources, leading to noisy results

These issues gave rise to the decoupled design idea of ETCHR.

3

Section 03

Core Concepts of ETCHR: Decoupled Architecture and Key Designs

Core Concepts and Architecture of ETCHR

Decoupled Design

Separate understanding and editing tasks:

  • Understanding Model: Focuses on visual understanding and reasoning (compatible with any MLLM)
  • Editing Model: Focuses on problem-conditional image editing (the main body of ETCHR)

Architecture Components

  • Input Encoding: Image encoder + text encoder + fusion module to integrate multimodal information
  • Editing Generation: Decoder autoregressively generates operation sequences like cropping/zooming/highlighting
  • Image Rendering: Differentiable rendering module applies editing operations to generate new images

Key Features

  • Problem-conditional: Generates customized edits for specific problems
  • Reasoning context-aware: Uses intermediate reasoning results to optimize edits
  • Progressive editing: Supports multi-step coherent operations

This design bridges the gap between the language side (converting abstract problems to editing intentions) and the generation side (quality degradation in multi-step editing).

4

Section 04

Two-Stage Training Scheme of ETCHR

Two-Stage Training Scheme

Stage 1: Reasoning Imitation (Addressing the Language Side Gap)

  • Data: Large-scale edit trajectory dataset (original image + problem + reasoning chain + edit sequence + result image)
  • Training: Supervised fine-tuning to learn mapping from problem + reasoning process to edit operations
  • Goal: Enable the model to understand "why" to edit and "what" to edit

Stage 2: Reasoning Enhancement (Addressing the Generation Side Gap)

  • Reward Signal: Dual rewards (edit correctness + downstream reasoning accuracy)
  • Training: Reinforcement learning (PPO/DPO) to optimize the reward combination
  • Goal: Ensure edit quality remains stable as reasoning depth increases

The two-stage training is indispensable; together they improve model performance.

5

Section 05

Experimental Evaluation: ETCHR Delivers Significant Reasoning Improvements

Experimental Results and Analysis

Task Coverage

Tested on 5 types of tasks: fine-grained perception, chart understanding, logical reasoning, puzzle restoration, and 3D understanding

Model Improvement Data

  • Qwen3-VL-8B: Pass@1 from 55.95 → 60.77 (+4.82)
  • Gemini-3.1-Flash-Lite: 65.08 →70.55 (+5.47)
  • Kimi K2.5:76.55→81.16 (+4.61)

Task-Level Performance

  • Fine-grained perception shows the most significant improvement (+6-8%)
  • Chart understanding/puzzle restoration shows obvious improvement (+4-7%)
  • Logical reasoning/3D understanding shows steady improvement (+3-5%)

Ablation Experiments

  • Stage 1 only: +2-3% improvement
  • Adding Stage 2: Additional +2-3% improvement

This proves the effectiveness of the two-stage training.

6

Section 06

Application Value and Scenarios of ETCHR

Application Value and Scenarios

Plug-and-Play Features

  • Compatible with any MLLM without retraining
  • Supports open-source/closed-source models without affecting original capabilities

Practical Applications

  • Document Analysis: Process tables/charts/multi-column layouts
  • Medical Imaging: Zoom in on key areas, enhance contrast
  • Industrial Quality Inspection: Highlight defect areas, add measurement markers
  • Educational Assistance: Generate visual problem-solving processes

ETCHR is a universal visual reasoning enhancement tool.

7

Section 07

Future Research Directions and Conclusion

Future Directions and Conclusion

Future Research

  • Interactive editing: Support user feedback to guide editing
  • Video extension: Temporal dimension editing operations
  • Integration of editing and generation: Generate auxiliary diagrams
  • Multimodal editing: Support audio/3D models, etc.

Conclusion

ETCHR verifies the engineering path of "thinking with images" through its decoupled design and two-stage training. Its success reveals that in complex tasks, decoupled specialized optimization is more effective than end-to-end unified models. Future MLLMs will manipulate visual information more flexibly to solve more complex practical problems.