Zing Forum

Reading

Feedback Distillation: Enabling More Efficient Reasoning Training for Large Language Models in Lean Theorem Proving

Researchers propose the 'Feedback Distillation' training method, which solves the sparse reward, limited exploration, and mode collapse problems in the GRPO algorithm by enabling models to learn to match their own distribution conditioned on privileged feedback. It demonstrates better trajectory diversity and pass@k performance on Lean4 theorem proving tasks.

反馈蒸馏GRPOLean4定理证明强化学习稀疏奖励模式崩溃推理训练token级监督
Published 2026-05-29 13:35Recent activity 2026-06-01 11:25Estimated read 6 min
Feedback Distillation: Enabling More Efficient Reasoning Training for Large Language Models in Lean Theorem Proving
1

Section 01

Introduction: Feedback Distillation—A New Breakthrough in Reasoning Training for Lean Theorem Proving

This article is based on the paper 'Distilling LLM Feedback for Lean Theorem Proving' published on arXiv in May 2026 (link: http://arxiv.org/abs/2605.30861v1). Researchers propose the 'Feedback Distillation' training method, which addresses the sparse reward, limited exploration, and mode collapse issues of the GRPO algorithm in Lean4 theorem proving. It shows better trajectory diversity and pass@k performance, and forms a complementary synergy with GRPO.

2

Section 02

Research Background: Three Core Dilemmas of the GRPO Algorithm

Post-training of mainstream theorem proving models often combines supervised fine-tuning and GRPO reinforcement learning, but GRPO has three core problems: 1. Sparse rewards: Positive rewards are only given for completing full proofs, leading to insufficient learning signals; 2. Limited exploration: Sparse rewards make it hard to explore the vast solution space, easily falling into local optima; 3. Mode collapse: Repeating a few successful patterns, reducing output diversity.

3

Section 03

Core Method: Innovative Principles of Feedback Distillation

The core of Feedback Distillation is to enable models to learn to match their own distribution conditioned on privileged feedback at the token level: 1. Privileged feedback generation: Using stronger models or optimized conditions to generate high-quality feedback; 2. Conditional distribution learning: Training models to match their own output distribution under the condition of feedback; 3. Token-level supervision: Providing fine-grained learning signals, different from GRPO's sequence-level rewards.

4

Section 04

Empirical Evidence: Performance Improvement on Lean4 Tasks

In Lean4 theorem proving tasks, Feedback Distillation shows significant advantages: 1. Higher trajectory diversity, avoiding fixed problem-solving patterns; 2. Higher policy entropy, maintaining a rich output distribution; 3. Better pass@k scalability, especially with large k values, generating more high-quality candidate solutions.

5

Section 05

Method Synergy: Complementary Effect Between Feedback Distillation and GRPO

Feedback Distillation and GRPO can be synergistically enhanced: Initializing GRPO training with Feedback Distillation checkpoints achieves better performance than using either method alone. Feedback Distillation excels at breadth exploration to build a diverse strategy foundation, while GRPO excels at deep optimization to converge to high-quality solutions, forming a new paradigm of 'breadth exploration + deep optimization'.

6

Section 06

Technical Details: Privileged Feedback and Token-level Supervision

  • Privileged feedback design: Three methods are used to improve feedback quality: generating reference solutions with strong models, multi-sample aggregation, and validator assistance; - Advantages of token-level supervision: More precise credit assignment (identifying key steps), more stable learning (avoiding high variance), and faster convergence (fine-grained signals accelerate learning).
7

Section 07

Broad Impact and Future Directions

  • Significance for automated theorem proving: Reduces reliance on manual strategies and improves the ability to handle complex multi-step proofs; - Implications for general reasoning tasks: Applicable to sparse reward tasks such as code generation, mathematical problem solving, and scientific verification; - Open issues: Trade-off between feedback quality and cost, cross-domain generalization ability, and integration with techniques like chain-of-thought.
8

Section 08

Conclusion: An Important Advance in Reasoning Training

Feedback Distillation overcomes the limitations of traditional reinforcement learning through external knowledge injection and fine-grained supervision, demonstrating the possibility of synergy between different training paradigms. It not only improves the performance of current models but also provides new perspectives and directions for the development of AI reasoning capabilities.