Zing Forum

Reading

MCPO: Multi-Domain Contrastive Policy Optimization — Enabling Knowledge Sharing and Interference Elimination for Large Reasoning Models in Cross-Domain Learning

This article introduces the MCPO (Multi-Domain Contrastive Policy Optimization) method, which transforms cross-domain interactions from harmful competition to beneficial transfer via a contrastive learning mechanism, simultaneously enhancing the reasoning capabilities of large reasoning models across multiple domains such as mathematics, code, and logical reasoning.

MCPO多领域学习对比学习强化学习GRPO大推理模型知识共享策略优化跨领域迁移
Published 2026-05-25 13:42Recent activity 2026-05-26 14:19Estimated read 8 min
MCPO: Multi-Domain Contrastive Policy Optimization — Enabling Knowledge Sharing and Interference Elimination for Large Reasoning Models in Cross-Domain Learning
1

Section 01

Introduction: MCPO — Multi-Domain Contrastive Policy Optimization Empowers Large Models with Cross-Domain Knowledge Sharing and Interference Elimination

This article introduces the MCPO (Multi-Domain Contrastive Policy Optimization) method, which transforms cross-domain interactions from harmful competition to beneficial transfer through a contrastive learning mechanism, solving the problem of domain interference in multi-domain learning for large reasoning models. It simultaneously improves reasoning capabilities across multiple domains such as mathematics, code, and logical reasoning, even outperforming single-domain training in some scenarios. The original author team is Maricalce, the paper was published on arXiv on May 25, 2026, and the code has been open-sourced.

2

Section 02

Background: The Dilemma of Multi-Domain Learning for Large Reasoning Models

In recent years, post-training techniques (such as the GRPO reinforcement learning method) have improved the reasoning capabilities of large reasoning models, but there is a core problem in multi-domain scenarios: models cannot achieve consistent improvements across all domains simultaneously. The root cause lies in domain interference in policy optimization—differences in data and reasoning patterns across domains lead to gradient conflicts and knowledge forgetting. Traditional methods only focus on mitigating interference, ignoring that knowledge sharing is the key to transforming cross-domain interactions into beneficial transfer.

3

Section 03

Core Idea of MCPO: Contrastive Learning-Driven Knowledge Harmony

The core idea of MCPO is to reorganize the multi-domain learning process through a contrastive learning mechanism, treating domain differences not as noise but as clues to build a harmonious representation space. Key insight: Reasoning trajectories across different domains have structural relationships; transferable general patterns and contrast signals from positive and negative samples within a domain can be modeled to achieve two goals: 1. Cross-domain knowledge sharing (spreading transferable reasoning patterns); 2. Intra-domain knowledge consolidation (strengthening the consistency of correct reasoning).

4

Section 04

Method Details: Threefold Mechanism of Contrastive Policy Optimization

1. Positive Sample Identification: Cross-Domain Transferable Trajectories

Search for trajectories with similar reasoning structures in other domains as positive samples (e.g., mathematical inductive reasoning and code step-by-step debugging), and capture deep structural similarities through representation learning.

2. Negative Sample Construction: Contrast Signals from Incorrect Reasoning

Treat incorrect trajectories (from current or other domains) as negative samples, pull positive samples closer and push negative samples away, providing clear optimization boundaries to help distinguish domain-specific errors from general reasoning flaws.

3. Intra-Domain Alignment: Consolidate the Representation Space

Encourage correct trajectories in the same domain to be close in the representation space, preventing knowledge fragmentation and enhancing domain identity recognition.

5

Section 05

Experimental Validation: Cross-Domain Performance Improvement and Outperforming Single-Domain Training

MCPO's performance in benchmark tests for mathematics, code, and logical reasoning:

  1. Cross-domain consistency improvement: Compared to GRPO, all domains show stable improvements without the 'robbing Peter to pay Paul' phenomenon;
  2. Outperforming single-domain training: Multi-domain joint training exceeds specialized single-domain training in some scenarios;
  3. Representation space visualization: Shows a 'harmonious but distinct' structure—domain knowledge is both differentiated and overlapping, verifying the effectiveness of the methodology.
6

Section 06

Technical Implementation and Open-Source Contribution

The MCPO code has been open-sourced (GitHub: https://github.com/Maricalce/MCPO), including the core training framework, multi-domain data preprocessing, contrastive loss calculation module, experimental scripts, etc. The open-source code provides a foundation for future research, and can be extended to more domains (scientific/common sense reasoning), combined with other reinforcement learning techniques (PPO/DPO), and applied to larger model architectures.

7

Section 07

Profound Implications for AI Research

  1. Paradigm shift: From 'eliminating interference' to 'promoting sharing', transforming negative interactions into positive collaboration, applicable to multimodal, transfer, and continuous learning;
  2. Value of contrastive learning: Well-designed positive and negative samples enable learning more robust and transferable representations, extendable to cognitive tasks such as planning and decision-making;
  3. Direction of large model training: Multi-domain capability is a key requirement, and MCPO provides a technical path for building general AI assistants.