# Feedback Distillation: Enabling More Efficient Reasoning Training for Large Language Models in Lean Theorem Proving

> Researchers propose the 'Feedback Distillation' training method, which solves the sparse reward, limited exploration, and mode collapse problems in the GRPO algorithm by enabling models to learn to match their own distribution conditioned on privileged feedback. It demonstrates better trajectory diversity and pass@k performance on Lean4 theorem proving tasks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-29T05:35:00.000Z
- 最近活动: 2026-06-01T03:25:22.445Z
- 热度: 92.2
- 关键词: 反馈蒸馏, GRPO, Lean4, 定理证明, 强化学习, 稀疏奖励, 模式崩溃, 推理训练, token级监督
- 页面链接: https://www.zingnex.cn/en/forum/thread/lean
- Canonical: https://www.zingnex.cn/forum/thread/lean
- Markdown 来源: floors_fallback

---

## Introduction: Feedback Distillation—A New Breakthrough in Reasoning Training for Lean Theorem Proving

This article is based on the paper 'Distilling LLM Feedback for Lean Theorem Proving' published on arXiv in May 2026 (link: http://arxiv.org/abs/2605.30861v1). Researchers propose the 'Feedback Distillation' training method, which addresses the sparse reward, limited exploration, and mode collapse issues of the GRPO algorithm in Lean4 theorem proving. It shows better trajectory diversity and pass@k performance, and forms a complementary synergy with GRPO.

## Research Background: Three Core Dilemmas of the GRPO Algorithm

Post-training of mainstream theorem proving models often combines supervised fine-tuning and GRPO reinforcement learning, but GRPO has three core problems: 1. Sparse rewards: Positive rewards are only given for completing full proofs, leading to insufficient learning signals; 2. Limited exploration: Sparse rewards make it hard to explore the vast solution space, easily falling into local optima; 3. Mode collapse: Repeating a few successful patterns, reducing output diversity.

## Core Method: Innovative Principles of Feedback Distillation

The core of Feedback Distillation is to enable models to learn to match their own distribution conditioned on privileged feedback at the token level: 1. Privileged feedback generation: Using stronger models or optimized conditions to generate high-quality feedback; 2. Conditional distribution learning: Training models to match their own output distribution under the condition of feedback; 3. Token-level supervision: Providing fine-grained learning signals, different from GRPO's sequence-level rewards.

## Empirical Evidence: Performance Improvement on Lean4 Tasks

In Lean4 theorem proving tasks, Feedback Distillation shows significant advantages: 1. Higher trajectory diversity, avoiding fixed problem-solving patterns; 2. Higher policy entropy, maintaining a rich output distribution; 3. Better pass@k scalability, especially with large k values, generating more high-quality candidate solutions.

## Method Synergy: Complementary Effect Between Feedback Distillation and GRPO

Feedback Distillation and GRPO can be synergistically enhanced: Initializing GRPO training with Feedback Distillation checkpoints achieves better performance than using either method alone. Feedback Distillation excels at breadth exploration to build a diverse strategy foundation, while GRPO excels at deep optimization to converge to high-quality solutions, forming a new paradigm of 'breadth exploration + deep optimization'.

## Technical Details: Privileged Feedback and Token-level Supervision

- Privileged feedback design: Three methods are used to improve feedback quality: generating reference solutions with strong models, multi-sample aggregation, and validator assistance; - Advantages of token-level supervision: More precise credit assignment (identifying key steps), more stable learning (avoiding high variance), and faster convergence (fine-grained signals accelerate learning).

## Broad Impact and Future Directions

- Significance for automated theorem proving: Reduces reliance on manual strategies and improves the ability to handle complex multi-step proofs; - Implications for general reasoning tasks: Applicable to sparse reward tasks such as code generation, mathematical problem solving, and scientific verification; - Open issues: Trade-off between feedback quality and cost, cross-domain generalization ability, and integration with techniques like chain-of-thought.

## Conclusion: An Important Advance in Reasoning Training

Feedback Distillation overcomes the limitations of traditional reinforcement learning through external knowledge injection and fine-grained supervision, demonstrating the possibility of synergy between different training paradigms. It not only improves the performance of current models but also provides new perspectives and directions for the development of AI reasoning capabilities.
