Zing Forum

Reading

OmniVerifier-M1: A Multimodal Universal Verifier Based on Symbolic Meta-Verification and Decoupled Reinforcement Learning

This paper proposes the OmniVerifier-M1 multimodal verifier, which uses symbolic outputs (e.g., bounding boxes) as the basis for meta-verification and adopts decoupled reinforcement learning objectives. It achieves robust verification capabilities, fine-grained error localization, and supports dynamic region-level self-correction.

多模态验证元验证强化学习符号化输出视觉验证器错误定位智能体生成
Published 2026-05-28 01:56Recent activity 2026-05-28 12:52Estimated read 6 min
OmniVerifier-M1: A Multimodal Universal Verifier Based on Symbolic Meta-Verification and Decoupled Reinforcement Learning
1

Section 01

[Introduction] OmniVerifier-M1: Core Breakthroughs of a Multimodal Universal Verifier

This paper proposes the OmniVerifier-M1 multimodal verifier, which uses symbolic outputs (e.g., bounding boxes) as the basis for meta-verification and adopts decoupled reinforcement learning objectives. It achieves robust verification capabilities, fine-grained error localization, and dynamic region-level self-correction. This verifier supports general visual verification tasks and can also empower generation systems (e.g., M1-TTS) to improve output quality, providing a foundation for the reliable deployment of multimodal models.

2

Section 02

Research Background: Bottlenecks in Multimodal Verification and Meta-Verification Ideas

With the development of multimodal large language models, verifying the reliability of generated content has become a key bottleneck for large-scale applications. Traditional verification methods only provide binary judgments, lacking fine-grained localization and interpretability. Meta-verification uses the verifier's own reasoning basis, but how to effectively use meta-verification feedback to train better multimodal verifiers remains an open problem.

3

Section 03

Core Methods: Symbolic Meta-Verification and Decoupled Reinforcement Learning Design

Advantages of Symbolic Outputs: Experiments show that symbolic outputs (e.g., bounding boxes) as meta-verification basis are superior to text explanations, due to supporting rule-based rewards, reducing reliance on auxiliary models, and providing precise spatial information.

Decoupled RL Objectives: Decoupling binary judgment and meta-verification objectives is better than joint optimization, avoiding interference from output structure differences, learning dynamic conflicts, and gradient issues.

Based on this, the OmniVerifier-M1 architecture is designed: symbolic meta-verification (output structured symbols) + decoupled reinforcement learning (optimize verification judgment head and meta-verification head separately) + general visual verification capabilities.

4

Section 04

Key Features and Application Scenarios: Fine-Grained Localization and Verification-Driven Generation

Fine-Grained Error Localization: Precisely locate problem areas via bounding boxes, enhancing interpretability and supporting dynamic correction.

M1-TTS System: A verifier-driven generation system that achieves dynamic region-level self-correction, iterative optimization loops, and controllable generation.

Application Prospects: Can be used in scenarios such as image generation verification, visual question answering verification, document understanding verification, content moderation, and medical image analysis.

5

Section 05

Experimental Performance and Summary of Technical Contributions

Experimental Verification: Excellent performance in standard verification task accuracy, meta-verification quality (high scores in human evaluation), error localization precision (superior to text-based methods), and generation quality improvement (significant improvement in M1-TTS).

Technical Contributions: 1. Symbolic meta-verification paradigm; 2. Decoupled RL objective design; 3.U niversal visual verifier OmniVerifier-M1; 4. Demonstration of a verifier-driven generation system.

6

Section 06

Limitations and Future Research Directions

Limitations: 1. Currently focused on visual verification; expansion to other modalities needs exploration; 2. Symbolic outputs are limited to bounding boxes; more rich structured representations are needed; 3. Decoupled training increases complexity; efficient joint training needs research.

Future Directions: Expand to more modalities and tasks, explore richer symbolic representations, develop efficient end-to-end training methods, and integrate into more generation systems.