Zing Forum

Reading

Compress-Distill: Reasoning Trace Compression Enables Efficient Knowledge Distillation

The research team explores post-processing compression methods for the chain-of-thought (CoT) of reasoning models. They found that compressed traces can reduce training tokens to 12-30% of the original, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x. While original traces still maintain the highest accuracy, compressed traces perform excellently in the accuracy-efficiency trade-off: small student models can retain 96% of the original accuracy while achieving an 18x improvement in token efficiency.

知识蒸馏推理模型思维链压缩模型压缩知识迁移Chain-of-Thoughtknowledge distillationreasoning models模型效率
Published 2026-06-04 18:30Recent activity 2026-06-05 16:27Estimated read 8 min
Compress-Distill: Reasoning Trace Compression Enables Efficient Knowledge Distillation
1

Section 01

Introduction: Compress-Distill—Reasoning Trace Compression Boosts Knowledge Distillation Efficiency

The research team proposes the Compress-Distill method, which addresses efficiency issues in knowledge distillation by applying post-processing compression to the chain-of-thought (CoT) of reasoning models. Key findings: Compressed traces reduce training tokens to 12-30% of the original, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x; small student models can retain 96% of the original accuracy while gaining an 18x improvement in token efficiency. This method achieves a favorable balance between accuracy and efficiency.

Original Paper Info: arXiv preprint (June 4, 2026), title "Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation", link http://arxiv.org/abs/2606.05988v1

2

Section 02

Research Background: Dilemmas in Knowledge Distillation for Reasoning Models

Dual Characteristics of Reasoning Models

  • Advantages: Explicit chain-of-thought (CoT) provides strong interpretability, facilitating error diagnosis and knowledge distillation supervision.
  • Disadvantages: Verbose CoT (thousands to tens of thousands of tokens) leads to high computational overhead in training/inference, and student models tend to mimic the verbose style.

Knowledge Distillation Challenges

  • High Training Cost: Long sequences cause linear growth in training time/memory, making large-scale distillation impractical.
  • Student Behavior Bias: Small models mimic the teacher’s verbose outputs, conflicting with expectations for efficient reasoning.
  • Efficiency-Quality Trade-off: Simple truncation loses key information, while full traces are too costly—intelligent compression strategies are needed.
3

Section 03

Core Method: Post-Processing Compression of CoT in Compress-Distill

Core Idea

Apply post-processing compression to CoT before distillation to retain key reasoning steps and remove redundancies (repeated verification, unnecessary expansions, verbose expressions).

Method Flow

  1. Teacher generates full CoT;
  2. Instruction-tuned model performs semantic compression;
  3. Train student using compressed traces;
  4. Student learns concise reasoning.

Compression Effect

Compressed traces are only 8.6-21.0% of the original length, significantly reducing training tokens while maintaining reasoning integrity.

4

Section 04

Experimental Evidence: Efficiency and Accuracy Performance of Compressed Distillation

Experimental Setup

  • Teacher Models: Qwen3.5-397B-A17B, gpt-oss-120B (generated 283,000 correct traces);
  • Compression Model: Instruction-tuned model;
  • Student Models: 0.8B to large-scale (full parameter/LoRA fine-tuning);
  • Evaluation Tasks: Math/logic reasoning (48 main experiments +7 ablation studies).

Key Results

  • Training Efficiency: Tokens reduced by 12-30%, speed increased by 2-7.6x, memory usage decreased;
  • Inference Efficiency: Outputs shortened by3-19x, generation speed improved;
  • Accuracy: Original traces have the highest accuracy, compressed traces retain 96% of it;
  • Ablation Studies: Intelligent compression outperforms fixed-length truncation, small student models benefit more;
  • LoRA Setup: 0.8B model using compressed traces performs close to those using original traces.
5

Section 05

Conclusion: Analysis of Accuracy-Efficiency Trade-off

Nature of the Trade-off

Compression offers an accuracy-efficiency trade-off, not a free improvement:

  • Ultimate Performance: Choose original traces when resources are sufficient (e.g., critical tasks like medical applications);
  • Efficiency Priority: Choose compressed traces when resources are limited (e.g., production/real-time applications);
  • Balanced Solution: Compressed traces (96% accuracy +2-7x efficiency) are suitable for most scenarios.

Per-Token Efficiency

Compressed traces have an 18x higher per-token efficiency (accuracy per consumed token) than original traces, leading to better resource utilization.

6

Section 06

Practical Recommendations: Application Scenarios and Implementation Guide for Compress-Distill

Recommended Scenarios

  • Resource-limited environments requiring efficient training;
  • Fast iteration experiments;
  • Inference latency-sensitive production environments;
  • Student model size <7B.

Cautionary Scenarios

  • Tasks requiring extreme accuracy;
  • Teacher outputs are already concise;
  • Sufficient computing resources.

Implementation Tips

  • Choose lightweight instruction models for compression (e.g., Qwen2.5-7B-Instruct);
  • Start with moderate compression (retain 15-20% of original length) for tuning;
  • Multi-stage strategy: Use compressed traces for fast baseline + original traces for fine-tuning.
7

Section 07

Limitations and Future Directions: Improvement Opportunities for Compress-Distill

Current Limitations

  • Compression inevitably loses information;
  • Effect depends on the quality of the compression model;
  • Strong specificity to tasks and teacher models.

Future Directions

  • Adaptive compression (dynamically adjust based on problem difficulty);
  • Explainable compression (justify deletion reasons);
  • Multi-teacher fusion and end-to-end joint training;
  • Theoretical analysis from an information theory perspective.