Reading

Quality-Utility Paradox: Why High-Reward Data Harms Small Models' Mathematical Reasoning Ability

A paper accepted at ICML 2026 reveals a counterintuitive finding: data refined by strong models (Oracles) with higher reward scores actually performs worse than data generated and filtered by the small models themselves. The study proposes a style-aligned refinement method that preserves the small model's native reasoning distribution while maintaining logical fixes.

知识蒸馏数学推理小语言模型奖励模型分布漂移风格对齐knowledge distillationmathematical reasoning

Published 2026-06-15 11:13Recent activity 2026-06-16 12:22Estimated read 6 min

Quality-Utility Paradox: Why High-Reward Data Harms Small Models' Mathematical Reasoning Ability

Section 01

Introduction: The Quality-Utility Paradox Challenges Traditional Understanding of Knowledge Distillation

A paper accepted at ICML 2026 reveals a counterintuitive finding: high-reward data refined by strong models (Oracles) actually harms small models' mathematical reasoning ability more than data generated and filtered by the small models themselves. This phenomenon is called the 'Quality-Utility Paradox', and its core cause is that Oracle refinement leads to a drift in the small model's native reasoning distribution. The study proposes a style-aligned refinement method to address this issue.

Section 02

Research Background: Common Assumptions of Knowledge Distillation

Knowledge distillation is a common technique to enhance the capabilities of small language models (SLMs). In mathematical reasoning tasks, the mainstream approach is to use Oracles to generate high-quality reasoning trajectories for training student models. The core assumption is: the higher the reward model score of a trajectory, the better its quality and the better the distillation effect. This study challenges this assumption.

Section 03

Core Finding: The Quality-Utility Paradox

Experiments validate the 'Quality-Utility Paradox': the training effect of high-reward data refined by Oracles is consistently worse than that of data generated by small models themselves plus rejection sampling. This phenomenon exists across Qwen2.5, LLaMA-3, and DeepSeek series models, indicating it is a universal phenomenon rather than an exception.

Section 04

Mechanism Analysis: Trade-off Between Distribution Drift and Adaptation Cost

Oracle refinement has dual effects: logical repair (correcting errors, positive) and distribution drift (changing reasoning style, deviating from the small model's native distribution, negative). Small models face a trade-off during learning: the benefit of logical repair vs. the cost of distribution adaptation. When the drift is large enough, the adaptation cost exceeds the benefit, leading to performance degradation.

Section 05

Solution: Style-Aligned Refinement Method

Core idea: Logical correctness and reasoning style can be separated. Implementation steps: 1. Preserve the small model's native trajectory; 2. Use an Oracle or validator to locate errors; 3. Modify only the error steps while keeping other steps in their native expression; 4. Style consistency check. Effect: Reduces adaptation cost, preserves logical benefits, and outperforms baselines.

Section 06

Experimental Results: Validation Across Multiple Model Families

Experimental setup: Model families include Qwen2.5, LLaMA-3, DeepSeek; Data comparisons are Oracle refinement, self-generated + rejection sampling, style-aligned refinement; Evaluation metric is mathematical reasoning accuracy. Key findings: The paradox exists, drift quantification is significant, and the style-aligned method has the best performance.

Section 07

Theoretical Implications and Practical Recommendations

Theoretical implications: Data quality needs to be redefined (perceived quality + learner compatibility), and a joint optimization framework should be adopted: Total utility = benefit of logical correctness - cost of distribution adaptation. Practical recommendations: 1. Use Oracle refinement cautiously; 2. Pay attention to the distribution matching between data and student models; 3. Try style-aligned refinement; 4. Take the final model performance as the gold standard for data quality.

Section 08

Limitations and Future Directions

Limitations: Only validated on mathematical reasoning tasks, style quantification is heuristic, and supervision is required. Future directions: Validate on other tasks (e.g., code generation), precise style quantification, and develop automated style-aligned methods.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23