Reading

OPSD: Post-RL Compression Stage for Reasoning Models—Paradigm Shift from Correction to Simplification

Reveal the true mechanism of OPSD in chain-of-thought reasoning: it is primarily a compression tool rather than a correction tool. In mathematical reasoning tasks, applying OPSD only to correct reasoning trajectories can significantly shorten output length while maintaining accuracy, whereas applying it to incorrect trajectories harms performance.

OPSD自蒸馏思维链推理模型模型压缩强化学习后训练数学推理

Published 2026-05-07 21:04Recent activity 2026-05-08 12:57Estimated read 8 min

OPSD: Post-RL Compression Stage for Reasoning Models—Paradigm Shift from Correction to Simplification

Section 01

[Introduction] The True Role of OPSD in Reasoning Models: Compression Tool Rather Than Correction Tool

This article reveals the core role of OPSD (On-Policy Self-Distillation) in chain-of-thought reasoning—it is primarily a compression tool rather than a correction tool. In mathematical reasoning tasks, applying OPSD only to correct reasoning trajectories can maintain accuracy while significantly shortening output length, whereas applying it to incorrect trajectories harms performance. Based on this, the paper proposes a new post-training process: SFT→RLVR→OPSD, where each stage performs its own function to achieve efficient reasoning.

Section 02

Background and Traditional Paths of Post-Training for Reasoning Models

Large Reasoning Models (LRMs) improve performance on complex tasks by generating detailed Chain-of-Thought (CoT), but the verbosity of CoT leads to high token consumption and latency. There are two traditional post-training paths: 1. Reinforcement Learning with Verifiable Rewards (RLVR): Train efficient strategies using verifiable rewards, but training is complex and prone to over-optimization; 2. Knowledge Distillation: Rely on teacher models to generate trajectories for training student models—simple and effective but limited by the teacher model. As a compromise, OPSD does not require an external teacher and learns from its own experience through post-hoc supervision, and was once expected to simultaneously improve accuracy and shorten response time.

Section 03

Working Principle of OPSD and Early Successful Scenarios

The core of OPSD is "post-hoc supervision": generate reasoning trajectories → evaluate the correctness of answers → credit assignment (identify redundancy in correct trajectories or key issues in incorrect trajectories) → train the model to optimize choices. It combines the advantages of RL (learning from its own experience) and distillation (fine-grained token supervision). In the "thinking-disabled" scenario (directly generating answers), OPSD can improve accuracy and eliminate redundant steps, showing good results.

Section 04

Unexpected Findings in Chain-of-Thought Reasoning

When OPSD is applied to "thinking-enabled" mathematical reasoning tasks, the accuracy improvement shrinks significantly or even becomes negative. Hypothetical explanation: Post-hoc supervision can effectively specify better token replacements in short reasoning, but in long chain-of-thought, it is easier to identify redundancy rather than provide better alternatives—errors in short reasoning are easy to trace back to key decisions, errors in long reasoning are hard to attribute, and correct long reasoning is already relatively optimized.

Section 05

Experimental Design and Result Verification

The experiment separates compression and correction effects: divide reasoning trajectories into correct and incorrect groups, and apply OPSD to each group separately. Results: The accuracy of the correct-only OPSD group remains basically unchanged, and the output is significantly shortened; the accuracy of the incorrect-only OPSD group decreases, and the output length changes little. This proves that OPSD mainly plays a compression role in CoT reasoning and cannot effectively correct incorrect trajectories.

Section 06

Deep Reasons Why OPSD Struggles to Correct Long Reasoning

Difficulty in error attribution: Errors in long chains stem from the accumulation of multiple decisions, making precise positioning difficult; 2. Limited optimization space for correct trajectories: Correct long chains have already self-corrected, leaving little room for compression; 3. Scarcity of alternative solutions: Correct alternative paths for long chains vary greatly, making token-level replacement hard to correct; 4. Compression is safer: Deleting redundancy has low risk, while correction easily introduces new errors.

Section 07

Suggestions for Revised Post-Training Process

Propose a three-stage process: 1. SFT (Supervised Fine-Tuning): Teach basic reasoning formats using high-quality data; 2. RLVR: Explore efficient strategies through verifiable rewards; 3. OPSD Compression: Apply OPSD only to correct trajectories generated by RLVR for simplification, without correcting errors (handled by RLVR). Advantages of division of labor: RLVR is responsible for exploration, OPSD for simplification, avoiding the disadvantages of OPSD in correction.

Section 08

Research Implications and Conclusions

Implications: 1. Method selection should depend on task characteristics; 2. Compression and correction should be separated; 3. Multi-stage training is better; 4. Post-hoc supervision has limitations. Conclusion: OPSD is a powerful compression tool but not a reliable correction tool. Positioning it as the compression stage after RLVR can achieve efficient reasoning. Practitioners should let OPSD focus on "shorter", leaving "better" to tools like RLVR.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15