Zing Forum

Reading

Rethinking Generalization in Reasoning SFT: Conditional Analysis of Optimization, Data, and Model Capabilities

This study conducts a conditional analysis of the generalization problem in reasoning supervised fine-tuning (SFT) from three dimensions—optimization, data, and model capabilities—revealing the key factors affecting SFT generalization performance and their interaction mechanisms.

监督微调SFT泛化能力推理模型条件分析模型优化数据多样性
Published 2026-04-16 11:34Recent activity 2026-04-16 11:55Estimated read 8 min
Rethinking Generalization in Reasoning SFT: Conditional Analysis of Optimization, Data, and Model Capabilities
1

Section 01

[Introduction] Rethinking Generalization in Reasoning SFT: Conditional Analysis of Optimization, Data, and Model Capabilities

This study systematically analyzes the generalization problem of reasoning supervised fine-tuning (SFT) from three dimensions: optimization, data, and model capabilities, revealing the key factors affecting generalization performance and their interaction mechanisms. The study points out that generalization is a complex phenomenon involving multi-factor interactions, and traditional single-factor analysis is insufficient. It provides a conditional analysis framework and practical guidance for improving the generalization ability of reasoning models.

2

Section 02

Background and Challenges: Current Status and Generalization Dilemmas of Reasoning SFT

In recent years, SFT has achieved significant results in improving model reasoning capabilities (e.g., chain-of-thought learning), as seen in models like OpenAI o1 and DeepSeek-R1. However, training data mostly comes from specific domains/difficulty levels, and the generalization ability of models on out-of-distribution data is questionable. Traditional analysis focuses on single factors (e.g., data scale), while this study proposes that conditional analysis should be conducted from three dimensions: optimization, data, and model capabilities.

3

Section 03

Optimization Dimension: Impact of Hyperparameters and Training Strategies on Generalization

Optimization is the core of SFT, and hyperparameter selection directly affects generalization:

  1. Learning Rate and Training Steps: Too high easily leads to overfitting on surface patterns, too low fails to learn sufficiently; their interaction may cause "spurious generalization".
  2. Optimizer Selection: Implicit regularization effect is key; optimizers like AdamW that tend to flat loss basins have better generalization.
  3. Batch Size: Large batches have stable gradients but easily fall into sharp local optima; small batches have beneficial noise but unstable training—trade-offs significantly affect generalization.
4

Section 04

Data Dimension: Conditional Analysis of Diversity, Quality, and Scale

Data determines the upper limit of SFT learning:

  1. Diversity: Needs dual-dimensional diversity in domains (different reasoning tasks) and difficulty (simple to complex). Purely difficult data easily leads to overfitting; appropriate simple data helps build basic reasoning abilities.
  2. Quality and Noise: Noise like incorrect annotations and inconsistent formats interferes with learning; the stronger the model capability, the higher its robustness to noise.
  3. Scale and Saturation: Data scale improves performance but has diminishing marginal returns (saturation effect); the saturation point is related to model capability and optimization configuration.
5

Section 05

Model Capability Dimension: Role of Pre-training, Scale, and Architecture

Model capability is the foundation of generalization:

  1. Pre-training Quality: The general capabilities of pre-training (world knowledge, reasoning priors) affect generalization more than scale; a small model with high-quality pre-training may outperform a large model with low-quality pre-training.
  2. Scale and Emergence: When model scale reaches a threshold, reasoning/generalization abilities "emerge"; below the threshold, model generalization is limited, and above it, SFT can unlock potential.
  3. Architecture Design: Dense vs MoE, attention mechanisms, etc., affect generalization; deeper networks and better positional encoding are positively correlated with generalization.
6

Section 06

Three-Factor Interaction Effect: Synergistic Impact of Optimization, Data, and Model

The interaction of the three factors is complex:

  1. Optimization-Data: High-noise data requires conservative learning rates/early stopping; large-scale data benefits from large batches/long training; high-difficulty data needs fine-grained learning rate scheduling.
  2. Optimization-Model: Strong models can use aggressive optimization (large learning rate/long training); weak models need cautious strategies.
  3. Data-Model: Strong models need less but high-quality data; weak models need more data but are robust to noise.
  4. Three-Factor Combination: A matched combination (strong model + high-quality data + optimal configuration) has far better generalization than the sum of individual parts; mismatched combinations waste resources.
7

Section 07

Practical Recommendations and Future Research Directions

Practical Guidance:

  • Data: Prioritize diversity (domain + difficulty) and quality over quantity; cover the distribution of target scenarios.
  • Optimization: Adjust parameters based on model capability/data; use aggressive configurations for strong models and conservative ones for weak models.
  • Model: Choose architectures with high pre-training quality and relevant priors.
  • Configuration: Use the conditional framework to guide hyperparameter search.

Limitations and Future:

  • Limitations: Only targets specific reasoning tasks and does not involve subsequent stages like RLHF.
  • Directions: Expand to more domains; study generalization in multi-stage training; develop automated configuration recommendations; explore the impact of model merging.