Zing Forum

Reading

UniEditBench: A Unified Benchmark Platform for Image and Video Editing Based on Distilled Multimodal Large Models

This paper proposes the UniEditBench unified benchmark, which supports the evaluation of image and video reconstruction and instruction-driven editing. By distilling a 235B-parameter Multimodal Large Model (MLLM) into lightweight 4B/8B evaluators, it achieves low-cost and high-quality evaluation.

视觉编辑基准测试多模态大模型知识蒸馏UniEditBench图像视频编辑评估指标
Published 2026-04-17 17:21Recent activity 2026-04-20 10:26Estimated read 6 min
UniEditBench: A Unified Benchmark Platform for Image and Video Editing Based on Distilled Multimodal Large Models
1

Section 01

[Introduction] UniEditBench: A Unified Benchmark and Low-Cost Evaluation Solution for Image and Video Editing

This paper proposes the UniEditBench unified benchmark platform, which supports the evaluation of image and video reconstruction and instruction-driven editing. Its core innovations are:

  1. Establishing a unified evaluation protocol to solve the fragmentation problem of existing evaluations;
  2. Converting a 235B-parameter multimodal large model (MLLM) into lightweight 4B/8B evaluators via knowledge distillation, achieving low-cost and high-quality evaluation aligned with human preferences.
2

Section 02

Background: Four Fragmentation Dilemmas in Visual Editing Evaluation

Visual editing technology is developing rapidly, but evaluation methods are lagging and fragmented:

  1. Method-specific Silos: Different editing paradigms (reconstruction, instruction-driven, etc.) have inconsistent evaluation standards, making cross-method comparisons difficult;
  2. Video Evaluation Gap: Lack of reliable video editing benchmarks that consider temporal consistency;
  3. Misalignment Between Metrics and Human Preferences: Traditional automatic metrics (PSNR, SSIM, etc.) are inconsistent with human judgments;
  4. High Cost of Large Model Evaluation: Directly using 235B-level MLLMs for evaluation is expensive, which most teams cannot afford.
3

Section 03

UniEditBench: Unified Evaluation Protocol and Task Classification System

UniEditBench designs a unified evaluation protocol to support multiple editing paradigms:

  • Input-Output Standardization: Unify source data, editing instructions/examples, and output formats;
  • Consistent Evaluation Dimensions: Cover general dimensions such as structural fidelity and text alignment;
  • Task Classification System:
    • Image editing (9 categories): Operations like Add/Remove/Replace/Change;
    • Video editing (8 categories): Add time-series related dimensions on top of image tasks, including challenging tasks like Count/Reorder.
4

Section 04

Distilled Evaluator: Key to Balancing High Quality and Low Cost

Build lightweight evaluators via knowledge distillation:

  • Teacher Model: Qwen3-VL-235B-A22B (235 billion parameters, aligned with human preferences);
  • Student Models: 4B/8B parameter versions (friendly to resource-constrained environments, balancing cost and performance);
  • Multi-dimensional Scoring: Structural fidelity, text alignment, background consistency, naturalness (videos include temporal-spatial consistency), helping diagnose model weaknesses.
5

Section 05

Experimental Validation: Significant Distillation Effect and Cost Advantages

Experimental results show:

  1. High Consistency: 4B/8B evaluators have close correlation with teacher models and human judgments;
  2. Substantial Cost Reduction: Deployment cost is tens to hundreds of times lower than that of 235B models, supporting large-scale evaluation;
  3. Improved Fairness: Different methods can be compared fairly under the unified protocol, revealing the advantages and disadvantages of each paradigm.
6

Section 06

Practical Application Value of UniEditBench

Application scenarios of this platform include:

  • Research Tool: Standardized evaluation to avoid incomparable results;
  • Model Development: Multi-dimensional scores guide targeted improvements;
  • Product Selection: Enterprises select appropriate models according to scenario needs;
  • Competition Leaderboard: Provide fair evaluation standards to enhance the credibility of results.
7

Section 07

Limitations and Future Directions

Current limitations and improvement directions:

  1. Expansion of Evaluation Dimensions: Add dimensions such as creativity, diversity, and cultural sensitivity;
  2. Dynamic Evaluation: Explore interactive evaluation to handle ambiguous cases;
  3. Domain Adaptation: Develop domain-specific versions for e-commerce, medicine, etc.;
  4. Real-time Optimization: Improve the inference speed of lightweight evaluators to support real-time feedback.