Zing Forum

Reading

OPSD: Post-RL Compression Stage for Reasoning Models—Paradigm Shift from Correction to Simplification

Reveal the true mechanism of OPSD in chain-of-thought reasoning: it is primarily a compression tool rather than a correction tool. In mathematical reasoning tasks, applying OPSD only to correct reasoning trajectories can significantly shorten output length while maintaining accuracy, whereas applying it to incorrect trajectories harms performance.

OPSD自蒸馏思维链推理模型模型压缩强化学习后训练数学推理
Published 2026-05-07 21:04Recent activity 2026-05-08 12:57Estimated read 8 min
OPSD: Post-RL Compression Stage for Reasoning Models—Paradigm Shift from Correction to Simplification
1

Section 01

[Introduction] The True Role of OPSD in Reasoning Models: Compression Tool Rather Than Correction Tool

This article reveals the core role of OPSD (On-Policy Self-Distillation) in chain-of-thought reasoning—it is primarily a compression tool rather than a correction tool. In mathematical reasoning tasks, applying OPSD only to correct reasoning trajectories can maintain accuracy while significantly shortening output length, whereas applying it to incorrect trajectories harms performance. Based on this, the paper proposes a new post-training process: SFT→RLVR→OPSD, where each stage performs its own function to achieve efficient reasoning.

2

Section 02

Background and Traditional Paths of Post-Training for Reasoning Models

Large Reasoning Models (LRMs) improve performance on complex tasks by generating detailed Chain-of-Thought (CoT), but the verbosity of CoT leads to high token consumption and latency. There are two traditional post-training paths: 1. Reinforcement Learning with Verifiable Rewards (RLVR): Train efficient strategies using verifiable rewards, but training is complex and prone to over-optimization; 2. Knowledge Distillation: Rely on teacher models to generate trajectories for training student models—simple and effective but limited by the teacher model. As a compromise, OPSD does not require an external teacher and learns from its own experience through post-hoc supervision, and was once expected to simultaneously improve accuracy and shorten response time.

3

Section 03

Working Principle of OPSD and Early Successful Scenarios

The core of OPSD is "post-hoc supervision": generate reasoning trajectories → evaluate the correctness of answers → credit assignment (identify redundancy in correct trajectories or key issues in incorrect trajectories) → train the model to optimize choices. It combines the advantages of RL (learning from its own experience) and distillation (fine-grained token supervision). In the "thinking-disabled" scenario (directly generating answers), OPSD can improve accuracy and eliminate redundant steps, showing good results.

4

Section 04

Unexpected Findings in Chain-of-Thought Reasoning

When OPSD is applied to "thinking-enabled" mathematical reasoning tasks, the accuracy improvement shrinks significantly or even becomes negative. Hypothetical explanation: Post-hoc supervision can effectively specify better token replacements in short reasoning, but in long chain-of-thought, it is easier to identify redundancy rather than provide better alternatives—errors in short reasoning are easy to trace back to key decisions, errors in long reasoning are hard to attribute, and correct long reasoning is already relatively optimized.

5

Section 05

Experimental Design and Result Verification

The experiment separates compression and correction effects: divide reasoning trajectories into correct and incorrect groups, and apply OPSD to each group separately. Results: The accuracy of the correct-only OPSD group remains basically unchanged, and the output is significantly shortened; the accuracy of the incorrect-only OPSD group decreases, and the output length changes little. This proves that OPSD mainly plays a compression role in CoT reasoning and cannot effectively correct incorrect trajectories.

6

Section 06

Deep Reasons Why OPSD Struggles to Correct Long Reasoning

  1. Difficulty in error attribution: Errors in long chains stem from the accumulation of multiple decisions, making precise positioning difficult; 2. Limited optimization space for correct trajectories: Correct long chains have already self-corrected, leaving little room for compression; 3. Scarcity of alternative solutions: Correct alternative paths for long chains vary greatly, making token-level replacement hard to correct; 4. Compression is safer: Deleting redundancy has low risk, while correction easily introduces new errors.
7

Section 07

Suggestions for Revised Post-Training Process

Propose a three-stage process: 1. SFT (Supervised Fine-Tuning): Teach basic reasoning formats using high-quality data; 2. RLVR: Explore efficient strategies through verifiable rewards; 3. OPSD Compression: Apply OPSD only to correct trajectories generated by RLVR for simplification, without correcting errors (handled by RLVR). Advantages of division of labor: RLVR is responsible for exploration, OPSD for simplification, avoiding the disadvantages of OPSD in correction.

8

Section 08

Research Implications and Conclusions

Implications: 1. Method selection should depend on task characteristics; 2. Compression and correction should be separated; 3. Multi-stage training is better; 4. Post-hoc supervision has limitations. Conclusion: OPSD is a powerful compression tool but not a reliable correction tool. Positioning it as the compression stage after RLVR can achieve efficient reasoning. Practitioners should let OPSD focus on "shorter", leaving "better" to tools like RLVR.