Zing Forum

Reading

Counterintuitive Finding in Chain-of-Thought Training: Why Do Models with Lower Training Loss Have Worse Generalization?

Latest research reveals a paradox in chain-of-thought supervised fine-tuning of large models—models with lower training loss perform worse in reasoning benchmark tests. The root cause lies in differences in reasoning modes: branching exploration vs. convergent deduction.

Chain-of-ThoughtSupervised Fine-TuningDeepSeek-R1gpt-oss推理模式泛化性能训练损失数据筛选
Published 2026-04-02 15:00Recent activity 2026-04-03 12:48Estimated read 5 min
Counterintuitive Finding in Chain-of-Thought Training: Why Do Models with Lower Training Loss Have Worse Generalization?
1

Section 01

Introduction: Counterintuitive Paradox in Chain-of-Thought Training

Latest research reveals a counterintuitive finding in chain-of-thought supervised fine-tuning of large models: models with lower training loss have worse generalization. The root cause of this paradox lies in differences in reasoning modes—branching exploration vs. convergent deduction. This thread will elaborate on the research background, experimental design, core findings, and solutions across different floors.

2

Section 02

Research Background: Current State of Chain-of-Thought Supervised Fine-Tuning

Chain-of-thought (CoT) technology enables models to generate intermediate reasoning steps to improve reasoning ability. In the current SFT phase, CoT trajectories from stronger models are often used as supervision signals, and the industry generally believes that longer and more detailed trajectories can improve performance. However, is there an essential difference between CoT data from different sources? This question lacks systematic research, and this study aims to answer: How does the source of CoT data affect model generalization performance?

3

Section 03

Experimental Design: Controlled Comparative Study

The research team selected two models with comparable performance—DeepSeek-R1-0528 and gpt-oss-120b—as data sources. They controlled the problem set to be identical, used the same hyperparameters and base model, with the only variable being the source of CoT data, ensuring that the result differences are attributed to the inherent characteristics of the data itself.

4

Section 04

Core Finding: Divergence Between Training Loss and Generalization Performance

Experimental results show: Models trained with DeepSeek-R1 data have significantly lower training loss but perform much worse in reasoning benchmarks like AIME25 and BeyondAIME; while models trained with gpt-oss-120b data have better generalization performance, leading to a serious divergence between training loss and generalization performance.

5

Section 05

Differences in Reasoning Modes: Branching Exploration vs. Convergent Deduction

DeepSeek-R1 exhibits divergent exploration characteristics, with CoT full of branching attempts and redundant explorations; gpt-oss-120b, on the other hand, uses convergent deduction, with direct linear reasoning paths that efficiently lock onto problem-solving directions. The difference stems from model training objectives: DeepSeek emphasizes reinforcement learning exploration, while gpt-oss benefits from human feedback guiding efficient reasoning.

6

Section 06

Solution: Filtering CoT with Frequent Branches

The study proposes a strategy to filter CoT with frequent branches, eliminating inefficient trajectories through rules such as detecting backtracking signals and counting branch numbers. Models trained after filtering saw a 5.1% increase in AIME25 accuracy, a 5.5% increase in BeyondAIME, an average increase of 3.6%, and training time was reduced by about 20%.

7

Section 07

Implications for the Industry: New Dimensions of Data Quality

  1. Training loss is no longer a reliable indicator—excessively low loss may mean overfitting to inefficient patterns; 2. The style of CoT (divergent/convergent) is as important as its content; 3. Data filtering is more effective than blindly increasing data volume, providing new directions for data curriculum learning and distillation.