Zing Forum

Reading

RC-DPO: Mitigating Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Multimodal large reasoning models exhibit strong capabilities in complex visual-language tasks but still face severe hallucination issues. The RC-DPO method introduced in this article effectively mitigates hallucinations and improves the reliability of multimodal reasoning by optimizing the chain of thought (CoT) as a condition for answer generation.

多模态大模型幻觉问题直接偏好优化思维链蒙特卡洛树搜索视觉语言任务推理模型
Published 2026-05-27 11:27Recent activity 2026-05-28 10:19Estimated read 7 min
RC-DPO: Mitigating Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
1

Section 01

[Introduction] RC-DPO: A New Method to Mitigate Hallucination in Multimodal Large Reasoning Models

This article introduces the RC-DPO method (Reasoning-Conditioned Preference Optimization) published on arXiv, which aims to solve the hallucination problem of multimodal large reasoning models. The core idea is to optimize the chain of thought (CoT) as a condition for answer generation rather than part of the output, thereby improving reasoning reliability. Original paper information: Authors are arXiv authors, source platform is arXiv, original title is Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization, link: http://arxiv.org/abs/2605.27906v1, publication time: 2026-05-27T03:27:23Z.

2

Section 02

Research Background and Problem Definition

Multimodal large reasoning models handle complex visual-language tasks through reasoning paradigms but suffer from severe hallucination issues (generating conclusions inconsistent with image content). Existing methods use response-level direct preference optimization (DPO), treating the chain of thought and answer as a whole for optimization, leading to insufficient supervision of the chain of thought with effects similar to optimizing only the answer.

3

Section 03

Limitations of Existing Methods

Key flaws of traditional DPO methods: 1. The optimization objective focuses too much on the correctness of the final answer, ignoring the rationality of the reasoning path; 2. The quality of the chain of thought is not fully evaluated and utilized; 3. Even if the answer is correct, an incorrect reasoning process may still lead to hallucinations, which are difficult to detect.

4

Section 04

Detailed Explanation of the RC-DPO Method

Core Innovations: Explicitly model the chain of thought as a condition for answer generation, and compare preference differences of the same correct answer under different CoT conditions. Method Principles: 1. Conditional modeling (CoT as a condition for generating answers); 2. Contrastive learning (comparing the generation probability of the same correct answer under different CoTs); 3. Reasoning chain alignment (encouraging the generation of reasonable reasoning chains that support the answer). Preference Data Generation: Positive samples use Monte Carlo Tree Search (MCTS) to find CoTs that are visually grounded and logically consistent; negative samples construct reasoning chains with logical flaws through attention-guided CoT token pruning.

5

Section 05

Experimental Results and Effect Evaluation

RC-DPO shows significant improvements over traditional DPO: 1. Reduced hallucination rate and improved consistency between descriptions and image content; 2. Enhanced reasoning quality, with more logically rigorous chains of thought that are strongly associated with the answer; 3. Good cross-model generalization; 4. Better performance in visual question answering and image understanding benchmark tests.

6

Section 06

Technical Significance and Application Prospects

Theoretical Significance: Reveals structural flaws of existing preference optimization methods in complex reasoning tasks, pointing the way for future research; provides a new idea for supervising the reasoning process of models. Application Prospects: Can improve the decision-making credibility of AI systems in high-reliability scenarios such as medical image analysis, autonomous driving visual understanding, and industrial quality inspection.

7

Section 07

Future Research Directions and Conclusions

Future Directions: 1. Extend to more modalities such as audio and video; 2. Explore synergy with reinforcement learning; 3. Optimize computational efficiency (e.g., reduce MCTS overhead); 4. Enhance interpretability (analyze the impact of RC-DPO on model attention and reasoning patterns). Conclusions: RC-DPO effectively mitigates the hallucination problem of multimodal large reasoning models through reasoning-conditioned preference optimization. Its core contribution is optimizing the chain of thought as a condition, achieving fine-grained supervision of the reasoning process, and opening a new path for improving the reliability of multimodal AI.