Zing Forum

Reading

V-tableR1: Process-Supervised Reinforcement Learning Enables Verifiable Multimodal Table Reasoning

V-tableR1 leverages a process-supervised reinforcement learning framework to shift multimodal large models from black-box pattern matching to verifiable logical reasoning. This framework introduces a dedicated Critic VLM to provide step-by-step feedback, combined with the PGPO optimization algorithm. With only 4B parameters, it outperforms models 18 times its size and achieves the state-of-the-art among open-source models on complex table reasoning benchmarks.

过程监督强化学习多模态推理视觉思维链表格推理MLLMCritic模型PGPO算法可验证推理
Published 2026-04-23 00:44Recent activity 2026-04-23 10:50Estimated read 5 min
V-tableR1: Process-Supervised Reinforcement Learning Enables Verifiable Multimodal Table Reasoning
1

Section 01

V-tableR1: Process-Supervised Reinforcement Learning Enables Verifiable Multimodal Table Reasoning (Introduction)

V-tableR1 leverages a process-supervised reinforcement learning framework to shift multimodal large models from black-box pattern matching to verifiable logical reasoning. This framework introduces a dedicated Critic VLM to provide step-by-step feedback, combined with the PGPO optimization algorithm. With only 4B parameters, it outperforms models 18 times its size and achieves the state-of-the-art among open-source models on complex table reasoning benchmarks.

2

Section 02

Background: The Black-Box Dilemma of Current Multimodal Visual Reasoning

Current multimodal large language models (MLLMs) rely on superficial pattern matching rather than strict multi-step logical reasoning when handling visual tasks, leading to frequent visual hallucinations, shortcut guessing, and a lack of interpretability and verifiability in the reasoning process.

3

Section 03

Methodology: Process Supervision + Dual-Model Collaboration + PGPO Optimization

  1. Process supervision: Unlike traditional methods that only focus on final answers, it requires each step of reasoning to be transparent and verifiable; tables are chosen as the testbed, using their structural features to enable reasoning verification.
  2. Dual-model architecture: Policy VLM generates explicit visual chain-of-thought, while Critic VLM provides fine-grained step-by-step feedback.
  3. PGPO algorithm: Integrates process rewards (rewarding correct steps), decoupled policy constraints (balancing exploration and stability), and length-aware dynamic sampling (adjusting the length of reasoning chains).
4

Section 04

Experimental Evidence: Major Breakthroughs of Small Models Models

  • Scale efficiency: The 4B-parameter model outperforms open-source models 18 times its size;
  • Hallucination suppression: Effectively penalizes visual hallucinations and shortcut guessing;
  • Outperforms SFT: Reinforcement learning-optimized models significantly outperform supervised fine-tuning (SFT) versions;
  • Open-source SOTA: Achieves the state-of-the-art among open-source models on complex table reasoning tasks.
5

Section 05

Conclusion: Methodological and Technical Contributions

  • Verifiable reasoning framework: First to implement systematic process supervision in the visual domain, providing a reference for other visual reasoning tasks;
  • Table testbed: Proves the value of structured visual information in reasoning verification;
  • New RL paradigm: The PGPO algorithm provides optimization ideas for multimodal reinforcement learning.
6

Section 06

Application Prospects: Value in Multi-Scenario Deployment

Applicable to scenarios such as financial statement analysis, scientific research data processing, business intelligence decision support, and educational problem-solving guidance, providing a transparent and verifiable reasoning process.

7

Section 07

Limitations and Future Directions

Limitations: Only targets structured tables; natural image scenarios pose great challenges; high training cost of Critic and complexity of dual-model deployment. Future directions: Expand to natural image domains, develop efficient Critic training methods, and explore process supervision implementation under a single-model architecture.