Reading

StableOPD: Addressing Length Inflation in Online Policy Distillation for Large Models

The research team reveals the length inflation and truncation collapse issues in OPD training, proposes the StableOPD framework combining reference divergence constraints and mixed rollout distillation, which achieves an average performance improvement of 7.2%.

模型蒸馏OPD在线策略蒸馏训练稳定性StableOPD长度膨胀大语言模型

Published 2026-04-10 01:58Recent activity 2026-04-10 12:50Estimated read 6 min

StableOPD: Addressing Length Inflation in Online Policy Distillation for Large Models

Section 01

[Introduction] StableOPD: A New Framework to Address Length Inflation in Online Policy Distillation for Large Models

The research team reveals the length inflation and truncation collapse issues in Online Policy Distillation (OPD) training, proposes the StableOPD framework which combines reference divergence constraints and mixed rollout distillation, effectively improving training stability and achieving an average performance increase of 7.2% across multiple datasets.

Section 02

Background: The Rise of Model Distillation and Online Policy Distillation

The scaling of Large Language Models (LLMs) brings improved capabilities, but also increases deployment costs and inference latency, spurring the development of model distillation techniques. As an emerging paradigm, Online Policy Distillation (OPD) allows student models to train using their own generated responses, which theoretically enables learning the distribution encountered in practice, but in reality faces issues like training instability and collapse.

Section 03

Key Failure Modes of OPD: Length Inflation and Truncation Collapse

The study first reveals the length inflation phenomenon in OPD training: student model responses suddenly become longer, filled with repetition and redundancy; due to sequence length limits, overly long responses are truncated, leading to training data being dominated by truncated trajectories (truncation collapse), which is closely related to repetition saturation and exacerbates training instability.

Section 04

Root Cause of Length Inflation: Feedback Loop Between Objective Function and Data Collection

The OPD objective function implicitly prefers long and repetitive responses (more overlapping opportunities, stable gradients), forming a feedback loop: long responses → high rewards → reinforcement of long response strategies → even longer responses, eventually leading to out-of-control length; some samples easily trigger inflation, distorting the training data distribution.

Section 05

StableOPD Framework: Two Core Components for Stable Training

The StableOPD framework includes: 1. Reference Divergence Constraint: Limits the KL divergence between the student output distribution and the reference model (initial student or baseline model) to prevent excessive policy drift; 2. Mixed Rollout Distillation: Uses responses from student online rollouts, teacher outputs, manual annotations, etc., simultaneously to increase data diversity and smooth reward signals.

Section 06

Experimental Validation: Performance and Stability Improvements of StableOPD

Validated on mathematical reasoning datasets such as GSM8K and MATH: 1. Prevents truncation collapse, maintains reasonable response lengths, and smooth training curves; 2. Average performance improvement of 7.2%; 3. Reduction of repeated n-gram ratio by approximately 40%; 4. Effective generalization across models from 7B to 70B parameters.

Section 07

Practical Training Insights: The Importance of Monitoring and Constraints

StableOPD brings the following insights: 1. Monitoring changes in response length is a key indicator of training health; 2. When training with model self-generated data, we need to be alert to distribution shifts and require constraint mechanisms; 3. Reward design needs to consider feedback loops and incentive distortions; 4. Mixing multiple signal sources can obtain more robust training signals.

Section 08

Limitations and Future Directions

StableOPD currently focuses mainly on mathematical reasoning tasks; its effectiveness in other domains needs to be verified; the choice of reference model affects performance, so automatic selection/dynamic adjustment needs to be explored; future research can focus on content-based dynamic length control or explicit length optimization objectives.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15