# Rethinking Generalization in Reasoning SFT: Conditional Analysis of Optimization, Data, and Model Capabilities

> This study conducts a conditional analysis of the generalization problem in reasoning supervised fine-tuning (SFT) from three dimensions—optimization, data, and model capabilities—revealing the key factors affecting SFT generalization performance and their interaction mechanisms.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-16T03:34:13.000Z
- 最近活动: 2026-04-16T03:55:37.950Z
- 热度: 148.6
- 关键词: 监督微调, SFT, 泛化能力, 推理模型, 条件分析, 模型优化, 数据多样性
- 页面链接: https://www.zingnex.cn/en/forum/thread/sft
- Canonical: https://www.zingnex.cn/forum/thread/sft
- Markdown 来源: floors_fallback

---

## [Introduction] Rethinking Generalization in Reasoning SFT: Conditional Analysis of Optimization, Data, and Model Capabilities

This study systematically analyzes the generalization problem of reasoning supervised fine-tuning (SFT) from three dimensions: optimization, data, and model capabilities, revealing the key factors affecting generalization performance and their interaction mechanisms. The study points out that generalization is a complex phenomenon involving multi-factor interactions, and traditional single-factor analysis is insufficient. It provides a conditional analysis framework and practical guidance for improving the generalization ability of reasoning models.

## Background and Challenges: Current Status and Generalization Dilemmas of Reasoning SFT

In recent years, SFT has achieved significant results in improving model reasoning capabilities (e.g., chain-of-thought learning), as seen in models like OpenAI o1 and DeepSeek-R1. However, training data mostly comes from specific domains/difficulty levels, and the generalization ability of models on out-of-distribution data is questionable. Traditional analysis focuses on single factors (e.g., data scale), while this study proposes that conditional analysis should be conducted from three dimensions: optimization, data, and model capabilities.

## Optimization Dimension: Impact of Hyperparameters and Training Strategies on Generalization

Optimization is the core of SFT, and hyperparameter selection directly affects generalization:
1. **Learning Rate and Training Steps**: Too high easily leads to overfitting on surface patterns, too low fails to learn sufficiently; their interaction may cause "spurious generalization".
2. **Optimizer Selection**: Implicit regularization effect is key; optimizers like AdamW that tend to flat loss basins have better generalization.
3. **Batch Size**: Large batches have stable gradients but easily fall into sharp local optima; small batches have beneficial noise but unstable training—trade-offs significantly affect generalization.

## Data Dimension: Conditional Analysis of Diversity, Quality, and Scale

Data determines the upper limit of SFT learning:
1. **Diversity**: Needs dual-dimensional diversity in domains (different reasoning tasks) and difficulty (simple to complex). Purely difficult data easily leads to overfitting; appropriate simple data helps build basic reasoning abilities.
2. **Quality and Noise**: Noise like incorrect annotations and inconsistent formats interferes with learning; the stronger the model capability, the higher its robustness to noise.
3. **Scale and Saturation**: Data scale improves performance but has diminishing marginal returns (saturation effect); the saturation point is related to model capability and optimization configuration.

## Model Capability Dimension: Role of Pre-training, Scale, and Architecture

Model capability is the foundation of generalization:
1. **Pre-training Quality**: The general capabilities of pre-training (world knowledge, reasoning priors) affect generalization more than scale; a small model with high-quality pre-training may outperform a large model with low-quality pre-training.
2. **Scale and Emergence**: When model scale reaches a threshold, reasoning/generalization abilities "emerge"; below the threshold, model generalization is limited, and above it, SFT can unlock potential.
3. **Architecture Design**: Dense vs MoE, attention mechanisms, etc., affect generalization; deeper networks and better positional encoding are positively correlated with generalization.

## Three-Factor Interaction Effect: Synergistic Impact of Optimization, Data, and Model

The interaction of the three factors is complex:
1. **Optimization-Data**: High-noise data requires conservative learning rates/early stopping; large-scale data benefits from large batches/long training; high-difficulty data needs fine-grained learning rate scheduling.
2. **Optimization-Model**: Strong models can use aggressive optimization (large learning rate/long training); weak models need cautious strategies.
3. **Data-Model**: Strong models need less but high-quality data; weak models need more data but are robust to noise.
4. **Three-Factor Combination**: A matched combination (strong model + high-quality data + optimal configuration) has far better generalization than the sum of individual parts; mismatched combinations waste resources.

## Practical Recommendations and Future Research Directions

**Practical Guidance**:
- Data: Prioritize diversity (domain + difficulty) and quality over quantity; cover the distribution of target scenarios.
- Optimization: Adjust parameters based on model capability/data; use aggressive configurations for strong models and conservative ones for weak models.
- Model: Choose architectures with high pre-training quality and relevant priors.
- Configuration: Use the conditional framework to guide hyperparameter search.

**Limitations and Future**:
- Limitations: Only targets specific reasoning tasks and does not involve subsequent stages like RLHF.
- Directions: Expand to more domains; study generalization in multi-stage training; develop automated configuration recommendations; explore the impact of model merging.