Zing Forum

Reading

Online Feedback Distillation: Enabling Small Models to Provide Reasoning Feedback Like Large Models

An innovative knowledge distillation framework that allows lightweight models to mimic the expert feedback capabilities of large models through online training, achieving a self-improvement loop in reasoning tasks.

知识蒸馏反馈循环推理模型大语言模型自我改进模型训练GSM8KChain-of-Thought
Published 2026-06-10 01:33Recent activity 2026-06-10 01:48Estimated read 8 min
Online Feedback Distillation: Enabling Small Models to Provide Reasoning Feedback Like Large Models
1

Section 01

Online Feedback Distillation: Enabling Small Models to Provide Reasoning Feedback Like Large Models (Introduction)

This article introduces an innovative knowledge distillation framework—Online Feedback Distillation—aimed at solving the feedback dilemma in reasoning models. The framework enables lightweight models to mimic the expert feedback capabilities of large models through online training, realizing a self-improvement loop in reasoning tasks. The core innovation lies in replacing fixed amateur models with adaptively learnable student models, combined with designs such as a unified model with dual roles, adaptive knowledge distillation gating, and multi-objective Pareto analysis. This reduces inference costs while improving the feedback quality of small models. The project is open-sourced on GitHub, supports multiple model configurations, and is friendly to Apple Silicon users.

2

Section 02

Feedback Dilemma of Reasoning Models

In research on improving LLM reasoning capabilities, traditional Chain-of-Thought (CoT) methods struggle to allow models to self-discover and correct reasoning errors. While expert-amateur feedback loops (e.g., the CLEAR method) have made progress, fixed amateur models cannot be improved, limiting feedback quality. This dilemma has driven the proposal of the Online Feedback Distillation framework.

3

Section 03

Core Innovations of the Online Feedback Distillation Framework

The core innovation of this project is the Online Feedback Distillation framework, which replaces fixed amateur models with adaptively learnable student models. Key design highlights include: 1. Unified model with dual roles: The large model acts as both the base model (generating initial answers) and the expert feedback provider (offering improvement suggestions), enhancing process efficiency; 2. Adaptive knowledge distillation gating: An EMA-based weighting strategy that triggers KD training only when the student model lags behind, avoiding unnecessary computations; 3. Multi-objective Pareto frontier analysis: Determines the KD stopping threshold through multi-dimensional metrics (language model loss, hidden layer alignment, etc.).

4

Section 04

Detailed Technical Architecture

The Online Feedback Distillation process steps are as follows: 1. Initial Answer Generation: The expert model generates initial answers; 2. Bidirectional Feedback Generation: The expert and student models generate feedback and scores respectively; 3. KD Trigger: If the student's score does not reach the threshold, start the KD network; 4. Adaptive Training: Train the student model using an EMA-weighted KD strategy with four loss functions; 5. Feedback Merging: Merge feedback from both models with priority given to expert feedback; 6. Answer Revision and Self-Criticism: Apply merged feedback to revise answers and generate final results.

5

Section 05

Model Configuration and Hardware Requirements

The project supports flexible model selection:

Role Default Model Alternative Model
Expert/Base Model Qwen2.5-7B-Instruct Llama-3.1-8B-Instruct
Student/Amateur Model Qwen2.5-1.5B-Instruct Llama-3.2-1B-Instruct
The default uses the Qwen2.5 series, no HuggingFace login required, and supports Apple Silicon MPS acceleration. Hardware requirements: Apple Silicon needs 16GB+ memory; CUDA GPU needs 16GB+ VRAM (e.g., A100, 3090); CPU is feasible but slower.
6

Section 06

Experiments and Evaluation

The project supports experiments on datasets like the GSM8K mathematical reasoning benchmark. Baseline comparisons include methods such as CLEAR, CoT, and CoD. Evaluation metrics cover multi-dimensional measures like BERTScore, ROUGE, BLEU, toxicity detection, and cosine similarity to ensure comprehensive performance assessment. A fast single-benchmark test script and complete suite tests are provided.

7

Section 07

Practical Significance and Application Prospects

This research opens a new path for efficient reasoning model training: 1. Reduce inference costs: Small models can self-improve without frequent calls to large models; 2. Model capability transfer: The reasoning and feedback capabilities of large models can be transferred to small models, supporting edge device deployment; 3. Continuous learning: The online learning feature allows models to continuously improve their capabilities during use.

8

Section 08

Summary and Reflections

The Online Feedback Distillation framework combines the efficiency of knowledge distillation with the quality of feedback loops, avoiding over-training through adaptive mechanisms. It represents an important direction for reasoning model training—building self-reflective and self-improving intelligent systems. For developers, this is a noteworthy open-source project that provides valuable insights for constructing cost-effective AI reasoning systems.