Reading

Online Feedback Distillation: Enabling Small Models to Provide Reasoning Feedback Like Large Models

An innovative knowledge distillation framework that allows lightweight models to mimic the expert feedback capabilities of large models through online training, achieving a self-improvement loop in reasoning tasks.

知识蒸馏反馈循环推理模型大语言模型自我改进模型训练GSM8KChain-of-Thought

Published 2026-06-10 01:33Recent activity 2026-06-10 01:48Estimated read 8 min

Section 01

Online Feedback Distillation: Enabling Small Models to Provide Reasoning Feedback Like Large Models (Introduction)

This article introduces an innovative knowledge distillation framework—Online Feedback Distillation—aimed at solving the feedback dilemma in reasoning models. The framework enables lightweight models to mimic the expert feedback capabilities of large models through online training, realizing a self-improvement loop in reasoning tasks. The core innovation lies in replacing fixed amateur models with adaptively learnable student models, combined with designs such as a unified model with dual roles, adaptive knowledge distillation gating, and multi-objective Pareto analysis. This reduces inference costs while improving the feedback quality of small models. The project is open-sourced on GitHub, supports multiple model configurations, and is friendly to Apple Silicon users.

Section 02

Feedback Dilemma of Reasoning Models

In research on improving LLM reasoning capabilities, traditional Chain-of-Thought (CoT) methods struggle to allow models to self-discover and correct reasoning errors. While expert-amateur feedback loops (e.g., the CLEAR method) have made progress, fixed amateur models cannot be improved, limiting feedback quality. This dilemma has driven the proposal of the Online Feedback Distillation framework.

Section 03

Core Innovations of the Online Feedback Distillation Framework

The core innovation of this project is the Online Feedback Distillation framework, which replaces fixed amateur models with adaptively learnable student models. Key design highlights include: 1. Unified model with dual roles: The large model acts as both the base model (generating initial answers) and the expert feedback provider (offering improvement suggestions), enhancing process efficiency; 2. Adaptive knowledge distillation gating: An EMA-based weighting strategy that triggers KD training only when the student model lags behind, avoiding unnecessary computations; 3. Multi-objective Pareto frontier analysis: Determines the KD stopping threshold through multi-dimensional metrics (language model loss, hidden layer alignment, etc.).

Section 04

Detailed Technical Architecture

The Online Feedback Distillation process steps are as follows: 1. Initial Answer Generation: The expert model generates initial answers; 2. Bidirectional Feedback Generation: The expert and student models generate feedback and scores respectively; 3. KD Trigger: If the student's score does not reach the threshold, start the KD network; 4. Adaptive Training: Train the student model using an EMA-weighted KD strategy with four loss functions; 5. Feedback Merging: Merge feedback from both models with priority given to expert feedback; 6. Answer Revision and Self-Criticism: Apply merged feedback to revise answers and generate final results.

Section 05

Model Configuration and Hardware Requirements

The project supports flexible model selection:

Role	Default Model	Alternative Model
Expert/Base Model	Qwen2.5-7B-Instruct	Llama-3.1-8B-Instruct
Student/Amateur Model	Qwen2.5-1.5B-Instruct	Llama-3.2-1B-Instruct
The default uses the Qwen2.5 series, no HuggingFace login required, and supports Apple Silicon MPS acceleration. Hardware requirements: Apple Silicon needs 16GB+ memory; CUDA GPU needs 16GB+ VRAM (e.g., A100, 3090); CPU is feasible but slower.

Section 06

Experiments and Evaluation

The project supports experiments on datasets like the GSM8K mathematical reasoning benchmark. Baseline comparisons include methods such as CLEAR, CoT, and CoD. Evaluation metrics cover multi-dimensional measures like BERTScore, ROUGE, BLEU, toxicity detection, and cosine similarity to ensure comprehensive performance assessment. A fast single-benchmark test script and complete suite tests are provided.

Section 07

Practical Significance and Application Prospects

This research opens a new path for efficient reasoning model training: 1. Reduce inference costs: Small models can self-improve without frequent calls to large models; 2. Model capability transfer: The reasoning and feedback capabilities of large models can be transferred to small models, supporting edge device deployment; 3. Continuous learning: The online learning feature allows models to continuously improve their capabilities during use.

Section 08

Summary and Reflections

The Online Feedback Distillation framework combines the efficiency of knowledge distillation with the quality of feedback loops, avoiding over-training through adaptive mechanisms. It represents an important direction for reasoning model training—building self-reflective and self-improving intelligent systems. For developers, this is a noteworthy open-source project that provides valuable insights for constructing cost-effective AI reasoning systems.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23