Reading

Counterintuitive Finding in Chain-of-Thought Training: Why Do Models with Lower Training Loss Have Worse Generalization?

Latest research reveals a paradox in chain-of-thought supervised fine-tuning of large models—models with lower training loss perform worse in reasoning benchmark tests. The root cause lies in differences in reasoning modes: branching exploration vs. convergent deduction.

Chain-of-ThoughtSupervised Fine-TuningDeepSeek-R1gpt-oss推理模式泛化性能训练损失数据筛选

Published 2026-04-02 15:00Recent activity 2026-04-03 12:48Estimated read 5 min

Counterintuitive Finding in Chain-of-Thought Training: Why Do Models with Lower Training Loss Have Worse Generalization?

Section 01

Introduction: Counterintuitive Paradox in Chain-of-Thought Training

Latest research reveals a counterintuitive finding in chain-of-thought supervised fine-tuning of large models: models with lower training loss have worse generalization. The root cause of this paradox lies in differences in reasoning modes—branching exploration vs. convergent deduction. This thread will elaborate on the research background, experimental design, core findings, and solutions across different floors.

Section 02

Research Background: Current State of Chain-of-Thought Supervised Fine-Tuning

Chain-of-thought (CoT) technology enables models to generate intermediate reasoning steps to improve reasoning ability. In the current SFT phase, CoT trajectories from stronger models are often used as supervision signals, and the industry generally believes that longer and more detailed trajectories can improve performance. However, is there an essential difference between CoT data from different sources? This question lacks systematic research, and this study aims to answer: How does the source of CoT data affect model generalization performance?

Section 03

Experimental Design: Controlled Comparative Study

The research team selected two models with comparable performance—DeepSeek-R1-0528 and gpt-oss-120b—as data sources. They controlled the problem set to be identical, used the same hyperparameters and base model, with the only variable being the source of CoT data, ensuring that the result differences are attributed to the inherent characteristics of the data itself.

Section 04

Core Finding: Divergence Between Training Loss and Generalization Performance

Experimental results show: Models trained with DeepSeek-R1 data have significantly lower training loss but perform much worse in reasoning benchmarks like AIME25 and BeyondAIME; while models trained with gpt-oss-120b data have better generalization performance, leading to a serious divergence between training loss and generalization performance.

Section 05

Differences in Reasoning Modes: Branching Exploration vs. Convergent Deduction

DeepSeek-R1 exhibits divergent exploration characteristics, with CoT full of branching attempts and redundant explorations; gpt-oss-120b, on the other hand, uses convergent deduction, with direct linear reasoning paths that efficiently lock onto problem-solving directions. The difference stems from model training objectives: DeepSeek emphasizes reinforcement learning exploration, while gpt-oss benefits from human feedback guiding efficient reasoning.

Section 06

Solution: Filtering CoT with Frequent Branches

The study proposes a strategy to filter CoT with frequent branches, eliminating inefficient trajectories through rules such as detecting backtracking signals and counting branch numbers. Models trained after filtering saw a 5.1% increase in AIME25 accuracy, a 5.5% increase in BeyondAIME, an average increase of 3.6%, and training time was reduced by about 20%.

Section 07

Implications for the Industry: New Dimensions of Data Quality

Training loss is no longer a reliable indicator—excessively low loss may mean overfitting to inefficient patterns; 2. The style of CoT (divergent/convergent) is as important as its content; 3. Data filtering is more effective than blindly increasing data volume, providing new directions for data curriculum learning and distillation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15