Zing Forum

Reading

OpenVLThinkerV2: A Universal Multimodal Reasoning Model for Multi-Domain Visual Tasks

This article introduces OpenVLThinkerV2, a universal multimodal reasoning model based on the Gaussian GRPO (G²RPO) reinforcement learning objective, which solves the problems of cross-task gradient fairness and perception-reasoning balance through distribution matching and task-level shaping mechanisms.

多模态大模型强化学习GRPO视觉推理开源模型分布匹配
Published 2026-04-10 01:59Recent activity 2026-04-10 10:44Estimated read 5 min
OpenVLThinkerV2: A Universal Multimodal Reasoning Model for Multi-Domain Visual Tasks
1

Section 01

Key Highlights of OpenVLThinkerV2

This article introduces OpenVLThinkerV2, a universal multimodal reasoning model. Its core innovations are the reinforcement learning objective based on Gaussian GRPO (G²RPO) and the accompanying task-level shaping mechanism, which solve the problems of cross-task gradient fairness and perception-reasoning balance. It has achieved surpassing both open-source and closed-source cutting-edge models in 18 benchmark tests.

2

Section 02

Dual Dilemmas of Multimodal Reinforcement Learning

Current multimodal large model training relies on GRPO, but when applied to open-source generalist models, it faces two major challenges: 1. Extreme variance in reward topology: The reward distributions of different visual tasks (such as OCR, chart reasoning) vary greatly, leading to gradient imbalance and model bias; 2. Perception-reasoning seesaw effect: Fine-grained perception requires focusing on local details, while complex reasoning requires long thought chains—traditional methods struggle to balance both.

3

Section 03

Gaussian GRPO (G²RPO): Nonlinear Optimization for Distribution Matching

G²RPO solves the shortcomings of traditional GRPO's linear scaling by forcing the advantage distribution of any task to converge to the standard normal distribution N(0,1). Its theoretical properties include: cross-task gradient fairness (balanced contribution of each task), heavy-tailed robustness (suppressing the impact of outliers), and symmetric update mechanism (balancing positive and negative sample learning).

4

Section 04

Task-Level Shaping Mechanism: Dynamic Balance Between Perception and Reasoning

Based on G²RPO, two shaping mechanisms are designed: 1. Response length shaping: Dynamically adjust the output length according to task complexity (long thought chains are encouraged for complex reasoning, while concise answers are encouraged for visual localization); 2. Entropy shaping: Control exploration behavior through entropy constraints to prevent entropy collapse or explosion, ensuring continuous and effective learning.

5

Section 05

OpenVLThinkerV2 Architecture and Training Process

OpenVLThinkerV2 inherits the mainstream MLLM architecture, and its training is divided into three stages: 1. Standard supervised fine-tuning to build basic capabilities; 2. G²RPO reinforcement learning stage, applying both response length and entropy shaping; 3. Refined adjustment for specific task families.

6

Section 06

Experimental Evaluation: Leading in 18 Benchmark Tests

In 18 multi-domain benchmark tests (such as document understanding DocVQA, chart reasoning ChartQA, visual localization RefCOCO series, general VQA VQAv2, OCR TextVQA, etc.), OpenVLThinkerV2 performed excellently: its average performance significantly surpasses open-source models of the same scale, and it outperforms leading closed-source commercial models in some tasks.

7

Section 07

Technical Insights and Future Outlook

The success of OpenVLThinkerV2 reveals: The G²RPO distribution matching paradigm can provide new ideas for RL training of pure text large models; task-level shaping demonstrates that fine strategy design can coordinate the balance of multiple capabilities. Future directions: Maintain training stability and achieve more fine-grained capability regulation when expanding task types and model scales.