Reading

MCPO: Multi-Domain Contrastive Policy Optimization — Enabling Knowledge Sharing and Interference Elimination for Large Reasoning Models in Cross-Domain Learning

This article introduces the MCPO (Multi-Domain Contrastive Policy Optimization) method, which transforms cross-domain interactions from harmful competition to beneficial transfer via a contrastive learning mechanism, simultaneously enhancing the reasoning capabilities of large reasoning models across multiple domains such as mathematics, code, and logical reasoning.

MCPO多领域学习对比学习强化学习GRPO大推理模型知识共享策略优化跨领域迁移

Published 2026-05-25 13:42Recent activity 2026-05-26 14:19Estimated read 8 min

MCPO: Multi-Domain Contrastive Policy Optimization — Enabling Knowledge Sharing and Interference Elimination for Large Reasoning Models in Cross-Domain Learning

Section 01

Introduction: MCPO — Multi-Domain Contrastive Policy Optimization Empowers Large Models with Cross-Domain Knowledge Sharing and Interference Elimination

This article introduces the MCPO (Multi-Domain Contrastive Policy Optimization) method, which transforms cross-domain interactions from harmful competition to beneficial transfer through a contrastive learning mechanism, solving the problem of domain interference in multi-domain learning for large reasoning models. It simultaneously improves reasoning capabilities across multiple domains such as mathematics, code, and logical reasoning, even outperforming single-domain training in some scenarios. The original author team is Maricalce, the paper was published on arXiv on May 25, 2026, and the code has been open-sourced.

Section 02

Background: The Dilemma of Multi-Domain Learning for Large Reasoning Models

In recent years, post-training techniques (such as the GRPO reinforcement learning method) have improved the reasoning capabilities of large reasoning models, but there is a core problem in multi-domain scenarios: models cannot achieve consistent improvements across all domains simultaneously. The root cause lies in domain interference in policy optimization—differences in data and reasoning patterns across domains lead to gradient conflicts and knowledge forgetting. Traditional methods only focus on mitigating interference, ignoring that knowledge sharing is the key to transforming cross-domain interactions into beneficial transfer.

Section 03

Core Idea of MCPO: Contrastive Learning-Driven Knowledge Harmony

The core idea of MCPO is to reorganize the multi-domain learning process through a contrastive learning mechanism, treating domain differences not as noise but as clues to build a harmonious representation space. Key insight: Reasoning trajectories across different domains have structural relationships; transferable general patterns and contrast signals from positive and negative samples within a domain can be modeled to achieve two goals: 1. Cross-domain knowledge sharing (spreading transferable reasoning patterns); 2. Intra-domain knowledge consolidation (strengthening the consistency of correct reasoning).

Section 04

Method Details: Threefold Mechanism of Contrastive Policy Optimization

1. Positive Sample Identification: Cross-Domain Transferable Trajectories

Search for trajectories with similar reasoning structures in other domains as positive samples (e.g., mathematical inductive reasoning and code step-by-step debugging), and capture deep structural similarities through representation learning.

2. Negative Sample Construction: Contrast Signals from Incorrect Reasoning

Treat incorrect trajectories (from current or other domains) as negative samples, pull positive samples closer and push negative samples away, providing clear optimization boundaries to help distinguish domain-specific errors from general reasoning flaws.

3. Intra-Domain Alignment: Consolidate the Representation Space

Encourage correct trajectories in the same domain to be close in the representation space, preventing knowledge fragmentation and enhancing domain identity recognition.

Section 05

Experimental Validation: Cross-Domain Performance Improvement and Outperforming Single-Domain Training

MCPO's performance in benchmark tests for mathematics, code, and logical reasoning:

Cross-domain consistency improvement: Compared to GRPO, all domains show stable improvements without the 'robbing Peter to pay Paul' phenomenon;
Outperforming single-domain training: Multi-domain joint training exceeds specialized single-domain training in some scenarios;
Representation space visualization: Shows a 'harmonious but distinct' structure—domain knowledge is both differentiated and overlapping, verifying the effectiveness of the methodology.

Section 06

Technical Implementation and Open-Source Contribution

The MCPO code has been open-sourced (GitHub: https://github.com/Maricalce/MCPO), including the core training framework, multi-domain data preprocessing, contrastive loss calculation module, experimental scripts, etc. The open-source code provides a foundation for future research, and can be extended to more domains (scientific/common sense reasoning), combined with other reinforcement learning techniques (PPO/DPO), and applied to larger model architectures.

Section 07

Profound Implications for AI Research

Paradigm shift: From 'eliminating interference' to 'promoting sharing', transforming negative interactions into positive collaboration, applicable to multimodal, transfer, and continuous learning;
Value of contrastive learning: Well-designed positive and negative samples enable learning more robust and transferable representations, extendable to cognitive tasks such as planning and decision-making;
Direction of large model training: Multi-domain capability is a key requirement, and MCPO provides a technical path for building general AI assistants.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15