Zing Forum

Reading

Beyond Distribution Sharpening: The Critical Role of Task Rewards in Reinforcement Learning

This article, through theoretical analysis and experimental validation, reveals the inherent limitations of distribution sharpening methods and demonstrates that task reward-based reinforcement learning can achieve more robust performance improvements and a stable learning process.

强化学习分布锐化任务奖励大语言模型推理能力GRPOPPO数学推理机器学习理论
Published 2026-04-18 01:17Recent activity 2026-04-20 11:21Estimated read 8 min
Beyond Distribution Sharpening: The Critical Role of Task Rewards in Reinforcement Learning
1

Section 01

[Introduction] Task Reward-Driven RL: Key Findings Beyond Distribution Sharpening

This article, through theoretical analysis and experimental validation, reveals the inherent limitations of distribution sharpening methods. It proves that task reward-based reinforcement learning (RL) is not merely distribution sharpening that "activates" the model's existing capabilities, but a genuine learning process that can achieve more robust performance improvements and a stable learning trajectory, capable of injecting new reasoning patterns and problem-solving strategies.

2

Section 02

Background: Core Differences Between Two RL Paradigms

Distribution Sharpening

Core Idea: Pre-trained models already possess rich knowledge; RL only selects high-quality outputs through preference optimization without introducing new capabilities (analogy: helping students play existing pieces stably).

Task Reward Learning

Core Perspective: Optimize the model based on the actual results of the task (e.g., mathematical correctness), autonomously explore new strategies through interaction, and acquire truly new capabilities.

3

Section 03

Theoretical Analysis: Three Inherent Limitations of Distribution Sharpening

  1. Suboptimal Equilibrium Point: The optimal solution may correspond to a suboptimal strategy, as it only selects within the existing distribution and cannot explore better solutions outside.
  2. Instability: Minor parameter changes during training lead to drastic oscillations in the output distribution.
  3. Local Optimum Trap: Exploration is limited to the pre-trained distribution, making it easy to fall into local optima.

Mathematical Intuition: Distribution sharpening optimizes within the support set of the pre-trained distribution. If the optimal strategy is outside this set, global optimality cannot be achieved (analogy: looking for the highest point in a valley, but the peak is in another valley).

4

Section 04

Experimental Design: A Framework for Fair Comparison of the Two Paradigms

Model Selection

  • Llama-3.2-3B-Instruct
  • Qwen2.5-3B-Instruct
  • Qwen3-4B-Instruct-2507

Task Domains

  • GSM8K (elementary school math word problems)
  • MATH dataset (high school/competitive math problems)

Paradigm Implementation

  • Distribution Sharpening: Rewards are based on the similarity between outputs and a high-quality reference distribution, without focusing on answer correctness.
  • Task Reward Learning: Correct answers receive positive rewards, incorrect ones receive negative/zero rewards, optimized using PPO or GRPO.
5

Section 05

Experimental Results: Significant Advantages of Task Reward RL

  1. Performance Improvement: Distribution sharpening only improves performance by a few percentage points, while task reward learning improves it by over 20%.
  2. Learning Stability: Distribution sharpening training shows oscillations, while the task reward learning curve rises steadily.
  3. Cross-Model Consistency: All tested models (Llama/Qwen series, 3B/4B parameters) show that task reward learning is superior.
6

Section 06

In-Depth Analysis: Three Reasons for Task Reward's Higher Effectiveness

  1. Exploration vs. Exploitation: Distribution sharpening purely exploits the existing distribution, while task reward learning allows exploration of strategies outside the distribution.
  2. Feedback Granularity: Distribution sharpening provides coarse feedback (only good/bad), while task reward learning provides clear feedback (correct/incorrect).
  3. Generalization Ability: Task reward learning forces the model to understand the problem structure, leading to more generalizable and transferable strategies.
7

Section 07

Practical Insights: Key Directions for Optimizing RL Training

  1. Reward Design: Prioritize using verifiable results (e.g., code execution, mathematical correctness) as rewards; when using a learned reward model, it must capture the real task objectives.
  2. Exploration Mechanism: Need to introduce exploration (e.g., GRPO comparing candidate answers) to avoid optimizing only within the pre-trained distribution.
  3. Training Stability: Use small learning rates, KL divergence constraints, and stable algorithms (PPO/GRPO).
8

Section 08

Limitations and Future Research Directions

Current Limitations

  • Task Scope: Only mathematical reasoning; other domains need verification.
  • Model Scale: The largest model used is 4B parameters; the behavior of large models (70B+) needs to be studied.
  • Reward Sparsity: Mathematical tasks use binary rewards; sparse reward tasks need adjustments.

Future Directions

  • Hybrid Methods: Distribution sharpening initialization + task reward fine optimization.
  • Curriculum Learning: Design task difficulty curricula to guide exploration.
  • Theoretical Deepening: Quantify the distance between the pre-trained distribution and the optimal strategy.
  • Cross-Domain Verification: Extend to code generation, scientific reasoning, and other domains.