Section 01
[Introduction] RC-DPO: A New Method to Mitigate Hallucination in Multimodal Large Reasoning Models
This article introduces the RC-DPO method (Reasoning-Conditioned Preference Optimization) published on arXiv, which aims to solve the hallucination problem of multimodal large reasoning models. The core idea is to optimize the chain of thought (CoT) as a condition for answer generation rather than part of the output, thereby improving reasoning reliability. Original paper information: Authors are arXiv authors, source platform is arXiv, original title is Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization, link: http://arxiv.org/abs/2605.27906v1, publication time: 2026-05-27T03:27:23Z.