# OPSD: Post-RL Compression Stage for Reasoning Models—Paradigm Shift from Correction to Simplification

> Reveal the true mechanism of OPSD in chain-of-thought reasoning: it is primarily a compression tool rather than a correction tool. In mathematical reasoning tasks, applying OPSD only to correct reasoning trajectories can significantly shorten output length while maintaining accuracy, whereas applying it to incorrect trajectories harms performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-07T13:04:34.000Z
- 最近活动: 2026-05-08T04:57:32.784Z
- 热度: 144.1
- 关键词: OPSD, 自蒸馏, 思维链, 推理模型, 模型压缩, 强化学习, 后训练, 数学推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/opsd-rl
- Canonical: https://www.zingnex.cn/forum/thread/opsd-rl
- Markdown 来源: floors_fallback

---

## [Introduction] The True Role of OPSD in Reasoning Models: Compression Tool Rather Than Correction Tool

This article reveals the core role of OPSD (On-Policy Self-Distillation) in chain-of-thought reasoning—it is primarily a compression tool rather than a correction tool. In mathematical reasoning tasks, applying OPSD only to correct reasoning trajectories can maintain accuracy while significantly shortening output length, whereas applying it to incorrect trajectories harms performance. Based on this, the paper proposes a new post-training process: SFT→RLVR→OPSD, where each stage performs its own function to achieve efficient reasoning.

## Background and Traditional Paths of Post-Training for Reasoning Models

Large Reasoning Models (LRMs) improve performance on complex tasks by generating detailed Chain-of-Thought (CoT), but the verbosity of CoT leads to high token consumption and latency. There are two traditional post-training paths: 1. Reinforcement Learning with Verifiable Rewards (RLVR): Train efficient strategies using verifiable rewards, but training is complex and prone to over-optimization; 2. Knowledge Distillation: Rely on teacher models to generate trajectories for training student models—simple and effective but limited by the teacher model. As a compromise, OPSD does not require an external teacher and learns from its own experience through post-hoc supervision, and was once expected to simultaneously improve accuracy and shorten response time.

## Working Principle of OPSD and Early Successful Scenarios

The core of OPSD is "post-hoc supervision": generate reasoning trajectories → evaluate the correctness of answers → credit assignment (identify redundancy in correct trajectories or key issues in incorrect trajectories) → train the model to optimize choices. It combines the advantages of RL (learning from its own experience) and distillation (fine-grained token supervision). In the "thinking-disabled" scenario (directly generating answers), OPSD can improve accuracy and eliminate redundant steps, showing good results.

## Unexpected Findings in Chain-of-Thought Reasoning

When OPSD is applied to "thinking-enabled" mathematical reasoning tasks, the accuracy improvement shrinks significantly or even becomes negative. Hypothetical explanation: Post-hoc supervision can effectively specify better token replacements in short reasoning, but in long chain-of-thought, it is easier to identify redundancy rather than provide better alternatives—errors in short reasoning are easy to trace back to key decisions, errors in long reasoning are hard to attribute, and correct long reasoning is already relatively optimized.

## Experimental Design and Result Verification

The experiment separates compression and correction effects: divide reasoning trajectories into correct and incorrect groups, and apply OPSD to each group separately. Results: The accuracy of the correct-only OPSD group remains basically unchanged, and the output is significantly shortened; the accuracy of the incorrect-only OPSD group decreases, and the output length changes little. This proves that OPSD mainly plays a compression role in CoT reasoning and cannot effectively correct incorrect trajectories.

## Deep Reasons Why OPSD Struggles to Correct Long Reasoning

1. Difficulty in error attribution: Errors in long chains stem from the accumulation of multiple decisions, making precise positioning difficult; 2. Limited optimization space for correct trajectories: Correct long chains have already self-corrected, leaving little room for compression; 3. Scarcity of alternative solutions: Correct alternative paths for long chains vary greatly, making token-level replacement hard to correct; 4. Compression is safer: Deleting redundancy has low risk, while correction easily introduces new errors.

## Suggestions for Revised Post-Training Process

Propose a three-stage process: 1. SFT (Supervised Fine-Tuning): Teach basic reasoning formats using high-quality data; 2. RLVR: Explore efficient strategies through verifiable rewards; 3. OPSD Compression: Apply OPSD only to correct trajectories generated by RLVR for simplification, without correcting errors (handled by RLVR). Advantages of division of labor: RLVR is responsible for exploration, OPSD for simplification, avoiding the disadvantages of OPSD in correction.

## Research Implications and Conclusions

Implications: 1. Method selection should depend on task characteristics; 2. Compression and correction should be separated; 3. Multi-stage training is better; 4. Post-hoc supervision has limitations. Conclusion: OPSD is a powerful compression tool but not a reliable correction tool. Positioning it as the compression stage after RLVR can achieve efficient reasoning. Practitioners should let OPSD focus on "shorter", leaving "better" to tools like RLVR.
