# RC-DPO: Mitigating Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

> Multimodal large reasoning models exhibit strong capabilities in complex visual-language tasks but still face severe hallucination issues. The RC-DPO method introduced in this article effectively mitigates hallucinations and improves the reliability of multimodal reasoning by optimizing the chain of thought (CoT) as a condition for answer generation.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-27T03:27:23.000Z
- 最近活动: 2026-05-28T02:19:10.331Z
- 热度: 126.1
- 关键词: 多模态大模型, 幻觉问题, 直接偏好优化, 思维链, 蒙特卡洛树搜索, 视觉语言任务, 推理模型
- 页面链接: https://www.zingnex.cn/en/forum/thread/rc-dpo
- Canonical: https://www.zingnex.cn/forum/thread/rc-dpo
- Markdown 来源: floors_fallback

---

## [Introduction] RC-DPO: A New Method to Mitigate Hallucination in Multimodal Large Reasoning Models

This article introduces the RC-DPO method (Reasoning-Conditioned Preference Optimization) published on arXiv, which aims to solve the hallucination problem of multimodal large reasoning models. The core idea is to optimize the chain of thought (CoT) as a condition for answer generation rather than part of the output, thereby improving reasoning reliability. Original paper information: Authors are arXiv authors, source platform is arXiv, original title is *Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization*, link: http://arxiv.org/abs/2605.27906v1, publication time: 2026-05-27T03:27:23Z.

## Research Background and Problem Definition

Multimodal large reasoning models handle complex visual-language tasks through reasoning paradigms but suffer from severe hallucination issues (generating conclusions inconsistent with image content). Existing methods use response-level direct preference optimization (DPO), treating the chain of thought and answer as a whole for optimization, leading to insufficient supervision of the chain of thought with effects similar to optimizing only the answer.

## Limitations of Existing Methods

Key flaws of traditional DPO methods: 1. The optimization objective focuses too much on the correctness of the final answer, ignoring the rationality of the reasoning path; 2. The quality of the chain of thought is not fully evaluated and utilized; 3. Even if the answer is correct, an incorrect reasoning process may still lead to hallucinations, which are difficult to detect.

## Detailed Explanation of the RC-DPO Method

**Core Innovations**: Explicitly model the chain of thought as a condition for answer generation, and compare preference differences of the same correct answer under different CoT conditions. 
**Method Principles**: 1. Conditional modeling (CoT as a condition for generating answers); 2. Contrastive learning (comparing the generation probability of the same correct answer under different CoTs); 3. Reasoning chain alignment (encouraging the generation of reasonable reasoning chains that support the answer). 
**Preference Data Generation**: Positive samples use Monte Carlo Tree Search (MCTS) to find CoTs that are visually grounded and logically consistent; negative samples construct reasoning chains with logical flaws through attention-guided CoT token pruning.

## Experimental Results and Effect Evaluation

RC-DPO shows significant improvements over traditional DPO: 1. Reduced hallucination rate and improved consistency between descriptions and image content; 2. Enhanced reasoning quality, with more logically rigorous chains of thought that are strongly associated with the answer; 3. Good cross-model generalization; 4. Better performance in visual question answering and image understanding benchmark tests.

## Technical Significance and Application Prospects

**Theoretical Significance**: Reveals structural flaws of existing preference optimization methods in complex reasoning tasks, pointing the way for future research; provides a new idea for supervising the reasoning process of models. 
**Application Prospects**: Can improve the decision-making credibility of AI systems in high-reliability scenarios such as medical image analysis, autonomous driving visual understanding, and industrial quality inspection.

## Future Research Directions and Conclusions

**Future Directions**: 1. Extend to more modalities such as audio and video; 2. Explore synergy with reinforcement learning; 3. Optimize computational efficiency (e.g., reduce MCTS overhead); 4. Enhance interpretability (analyze the impact of RC-DPO on model attention and reasoning patterns). 
**Conclusions**: RC-DPO effectively mitigates the hallucination problem of multimodal large reasoning models through reasoning-conditioned preference optimization. Its core contribution is optimizing the chain of thought as a condition, achieving fine-grained supervision of the reasoning process, and opening a new path for improving the reliability of multimodal AI.
