Zing Forum

Reading

Quality-Utility Paradox: Why High-Reward Data Harms Small Models' Mathematical Reasoning Ability

A paper accepted at ICML 2026 reveals a counterintuitive finding: data refined by strong models (Oracles) with higher reward scores actually performs worse than data generated and filtered by the small models themselves. The study proposes a style-aligned refinement method that preserves the small model's native reasoning distribution while maintaining logical fixes.

知识蒸馏数学推理小语言模型奖励模型分布漂移风格对齐knowledge distillationmathematical reasoning
Published 2026-06-15 11:13Recent activity 2026-06-16 12:22Estimated read 6 min
Quality-Utility Paradox: Why High-Reward Data Harms Small Models' Mathematical Reasoning Ability
1

Section 01

Introduction: The Quality-Utility Paradox Challenges Traditional Understanding of Knowledge Distillation

A paper accepted at ICML 2026 reveals a counterintuitive finding: high-reward data refined by strong models (Oracles) actually harms small models' mathematical reasoning ability more than data generated and filtered by the small models themselves. This phenomenon is called the 'Quality-Utility Paradox', and its core cause is that Oracle refinement leads to a drift in the small model's native reasoning distribution. The study proposes a style-aligned refinement method to address this issue.

2

Section 02

Research Background: Common Assumptions of Knowledge Distillation

Knowledge distillation is a common technique to enhance the capabilities of small language models (SLMs). In mathematical reasoning tasks, the mainstream approach is to use Oracles to generate high-quality reasoning trajectories for training student models. The core assumption is: the higher the reward model score of a trajectory, the better its quality and the better the distillation effect. This study challenges this assumption.

3

Section 03

Core Finding: The Quality-Utility Paradox

Experiments validate the 'Quality-Utility Paradox': the training effect of high-reward data refined by Oracles is consistently worse than that of data generated by small models themselves plus rejection sampling. This phenomenon exists across Qwen2.5, LLaMA-3, and DeepSeek series models, indicating it is a universal phenomenon rather than an exception.

4

Section 04

Mechanism Analysis: Trade-off Between Distribution Drift and Adaptation Cost

Oracle refinement has dual effects: logical repair (correcting errors, positive) and distribution drift (changing reasoning style, deviating from the small model's native distribution, negative). Small models face a trade-off during learning: the benefit of logical repair vs. the cost of distribution adaptation. When the drift is large enough, the adaptation cost exceeds the benefit, leading to performance degradation.

5

Section 05

Solution: Style-Aligned Refinement Method

Core idea: Logical correctness and reasoning style can be separated. Implementation steps: 1. Preserve the small model's native trajectory; 2. Use an Oracle or validator to locate errors; 3. Modify only the error steps while keeping other steps in their native expression; 4. Style consistency check. Effect: Reduces adaptation cost, preserves logical benefits, and outperforms baselines.

6

Section 06

Experimental Results: Validation Across Multiple Model Families

Experimental setup: Model families include Qwen2.5, LLaMA-3, DeepSeek; Data comparisons are Oracle refinement, self-generated + rejection sampling, style-aligned refinement; Evaluation metric is mathematical reasoning accuracy. Key findings: The paradox exists, drift quantification is significant, and the style-aligned method has the best performance.

7

Section 07

Theoretical Implications and Practical Recommendations

Theoretical implications: Data quality needs to be redefined (perceived quality + learner compatibility), and a joint optimization framework should be adopted: Total utility = benefit of logical correctness - cost of distribution adaptation. Practical recommendations: 1. Use Oracle refinement cautiously; 2. Pay attention to the distribution matching between data and student models; 3. Try style-aligned refinement; 4. Take the final model performance as the gold standard for data quality.

8

Section 08

Limitations and Future Directions

Limitations: Only validated on mathematical reasoning tasks, style quantification is heuristic, and supervision is required. Future directions: Validate on other tasks (e.g., code generation), precise style quantification, and develop automated style-aligned methods.