Reading

CiPO: Counterfactual Unlearning for Large Reasoning Models via Iterative Preference Optimization

This article proposes the CiPO framework, which performs iterative preference optimization by generating counterfactual reasoning trajectories. It completely removes target knowledge while preserving the model's reasoning ability, solving the challenge of machine unlearning for large reasoning models.

机器遗忘学习大型推理模型反事实推理偏好优化CiPO隐私保护CoT推理

Published 2026-04-17 16:56Recent activity 2026-04-20 10:20Estimated read 7 min

CiPO: Counterfactual Unlearning for Large Reasoning Models via Iterative Preference Optimization

Section 01

Introduction: The CiPO Framework Solves the Unlearning Challenge for Large Reasoning Models

This article proposes the CiPO (Counterfactual Unlearning through Iterative Preference Optimization) framework, which performs iterative preference optimization by generating counterfactual reasoning trajectories. It completely removes target knowledge while preserving the model's reasoning ability, solving the dilemma of machine unlearning for Large Reasoning Models (LRMs).

Section 02

Background of Machine Unlearning and Challenges Faced by LRMs

The Rise of Machine Unlearning

In recent years, machine unlearning has become a hot topic in AI. Its goal is to selectively remove unwanted information (privacy, copyright, outdated knowledge, etc.) from models without retraining.

Unique Challenges of Unlearning in LRMs

LRMs emphasize Chain-of-Thought (CoT) reasoning, but existing methods face a dilemma:

Superficial Unlearning: Only focuses on final outputs, ignoring CoT—sensitive information still remains in reasoning traces;
Over-Unlearning: Large-scale parameter updates impair general reasoning ability.

Balancing thorough unlearning and preserving reasoning ability is the core challenge.

Section 03

Core of the CiPO Framework: Counterfactual Reasoning and Iterative Preference Optimization

Core Concept: Counterfactual Reasoning Trajectories

For target knowledge, guide the model to generate logically valid reasoning trajectories with different conclusions, avoiding the target knowledge (e.g., when forgetting "Paris is the capital of France", generate uncertain reasoning).

Steps of Iterative Preference Optimization

Generate counterfactual reasoning;
Construct preference pairs (counterfactuals as preferred samples, reasoning containing target knowledge as non-preferred samples);
Use DPO to adjust the model to favor counterfactual reasoning;
Iteratively update preference data to ensure thorough unlearning.

Section 04

Technical Details of CiPO: Counterfactual Generation and Dynamic Preference Update

Counterfactual Reasoning Generation Strategies

Knowledge Boundary Prompting: Inform the model that certain information is outside its knowledge scope;
Alternative Path Exploration: Encourage solution paths that do not rely on target knowledge;
Logical Consistency Constraint: Ensure reasoning is self-consistent.

Dynamic Preference Data Update

Regularly sample the current model outputs, update non-preferred samples to prevent premature convergence and ensure thorough unlearning.

Section 05

Experimental Validation: Effectiveness and Advantages of CiPO

Thorough Unlearning Validation

CiPO completely removes target knowledge (neither final answers nor CoT reasoning contain target information), meeting privacy compliance requirements.

Preservation of Reasoning Ability

On standard reasoning benchmarks, the performance gap between CiPO-processed models and original models is significantly smaller than that of other methods.

Baseline Comparison

Gradient Ascent Method: Thorough unlearning but impairs reasoning;
Knowledge Distillation Method: Preserves reasoning but incomplete unlearning;
CiPO: Achieves the best balance between the two.

Section 06

Application Scenarios and Social Value of CiPO

Privacy Compliance: Respond to users' "right to be forgotten" without retraining;
Copyright Protection: Remove specific copyrighted content;
Fact Update: Replace outdated knowledge;
Harmful Content Filtering: Remove inappropriate content.

Section 07

Technical Limitations and Future Directions of CiPO

Limitations

High computational cost (multiple training rounds);
The quality of counterfactual reasoning for complex knowledge needs improvement;
Stability issues in multi-knowledge unlearning;
Insufficient interpretability of the unlearning mechanism.

Future Directions

Explore efficient optimization strategies, improve counterfactual quality, solve multi-knowledge unlearning issues, and enhance interpretability.

Section 08

Conclusion: The Significance of CiPO for AI Governance

CiPO solves the dilemma of LRM unlearning through counterfactual reasoning and iterative preference optimization, providing a new path for the controllability, safety, and compliance of AI systems. It is an important advancement in the field of machine unlearning.