Zing Forum

Reading

RLCSD: Contrastive Self-Distillation Addresses "Privilege-Induced Style Drift" in Reasoning Models

Researchers found that on-policy self-distillation (OPSD) suffers from the "privilege-induced style drift" problem, where learning signals are concentrated on style tokens rather than task tokens. The proposed RLCSD method addresses this issue by contrasting teacher-student gaps under correct and incorrect prompts, achieving consistent improvements across multiple models.

RLCSD强化学习自蒸馏推理模型对比学习风格漂移GRPO机器学习
Published 2026-06-10 14:31Recent activity 2026-06-11 12:24Estimated read 7 min
RLCSD: Contrastive Self-Distillation Addresses "Privilege-Induced Style Drift" in Reasoning Models
1

Section 01

RLCSD: A New Method to Address Privilege-Induced Style Drift in Reasoning Models (Introduction)

Title: RLCSD: Contrastive Self-Distillation Addresses "Privilege-Induced Style Drift" in Reasoning Models

Researchers found that on-policy self-distillation (OPSD) has the "privilege-induced style drift" problem, where learning signals are concentrated on style tokens rather than task tokens. The proposed RLCSD method solves this problem by contrasting teacher-student gaps under correct and incorrect prompts, achieving consistent improvements across multiple models.

Source Information:

Keywords: RLCSD, Reinforcement Learning, Self-Distillation, Reasoning Models, Contrastive Learning, Style Drift, GRPO, Machine Learning

2

Section 02

New Challenge in Reasoning Model Training: Style Drift Problem of OPSD

Large reasoning models (such as DeepSeek-R1, OpenAI o-series) have achieved significant results in mathematical and logical reasoning tasks through reinforcement learning. Among them, on-policy self-distillation (OPSD) is an important training technique that provides dense token-level supervision by aligning the model distribution with the distribution under privileged context (verified solutions). However, studies reveal that OPSD's learning signals have a serious bias: they are concentrated on style tokens rather than task tokens.

3

Section 03

Root Causes and Consequences of Privilege-Induced Style Drift

Root Causes of the Problem

When the model generates outputs under privileged prompts (correct answers/thinking), it tends to give more direct and concise responses (no need for exploration); without privileged prompts, it needs longer reasoning chains.

Consequences

  1. Training instability: The model swings between having and not having privileged prompts
  2. Shorter response length: Imitates the concise style, sacrificing deep reasoning
  3. Signal dilution: Task-related tokens do not get enough attention

In short, the model learns "how to say" rather than "how to think".

4

Section 04

RLCSD Method: Contrastive Learning Separates Style and Task Signals

The core idea of RLCSD (Reinforcement Learning with Contrastive On-Policy Self-Distillation) is to separate style and task signals through contrastive learning.

Core Mechanism

Consider two types of privileged prompts simultaneously:

  1. Correct prompt: Provides correct answers/thinking
  2. Incorrect prompt: Provides wrong answers/misleading thinking

By contrasting the teacher-student distribution gaps in these two cases, we achieve:

  • Identify style shifts (similar style changes in both cases)
  • Suppress style drift (offset common style components)
  • Focus on task signals (retain task-related differences)

Mathematical Intuition

Effective signal = (Gap under correct prompt) - (Gap under incorrect prompt) Style drift exists in both cases, so subtraction cancels it out; task signals only exist in the correct prompt, so they are retained.

5

Section 05

Experimental Validation: Consistent Improvements of RLCSD Across Multiple Models and Tasks

Test Models

Covers models of different scales: Qwen3 1.7B (lightweight), Qwen3 4B (medium), Qwen3 8B (larger), Olmo-3-7B-Think (open-source reasoning model)

Test Tasks

Mathematical problem solving (GSM8K, MATH, etc.), logical reasoning tasks, multi-step reasoning challenges

Main Results

  1. Consistently outperforms GRPO: Better than standard GRPO in all settings
  2. Outperforms existing OPSD methods: Stable improvements
  3. Scale independence: Improvements are maintained across different model scales

The results ensure universality.

6

Section 06

Generality of RLCSD and Training Insights

Generality of the Contrast Principle

  • Enhance existing OPSD: Can be inserted into existing methods to improve performance
  • Extend to cross-model distillation: Applicable to scenarios where teacher models guide student models

Training Insights

  1. Signal quality is more important than quantity: OPSD's dense supervision needs to ensure quality
  2. Be alert to implicit bias: Style drift is not easily reflected on the surface
  3. Power of contrastive learning: Separates important signals and can be generalized to other scenarios
7

Section 07

Limitations of RLCSD and Future Research Directions

Limitations

  • Error prompt design: How to design (random/systematic errors) to maximize the contrast effect
  • Computational overhead: Contrast requires generating and evaluating two sets of outputs, increasing overhead

Future Directions

  • Optimize error prompt design
  • Reduce computational overhead
  • Combine with other techniques (such as process reward model PRM, multi-agent methods)