# Demystifying the Art of Data Organization for Large Model Training: Four Principles and STR/SAW Sorting Methods

> Systematically analyzes the impact of data sorting on large model training, proposes four principles—Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity—and introduces two innovative data sorting methods: STR and SAW.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-28T17:58:53.000Z
- 最近活动: 2026-05-29T04:27:13.399Z
- 热度: 144.5
- 关键词: 数据组织, 数据排序, LLM训练, 大语言模型, 课程学习, STR, SAW, 数据策展, 训练效率, arXiv
- 页面链接: https://www.zingnex.cn/en/forum/thread/str-saw
- Canonical: https://www.zingnex.cn/forum/thread/str-saw
- Markdown 来源: floors_fallback

---

## [Introduction] Demystifying the Art of Data Organization for Large Model Training: Core Insights and Method Overview

**Original Paper Information**
- Author: Microsoft Research Team
- Source: arXiv
- Title: Demystifying Data Organization for Enhanced LLM Training
- Link: http://arxiv.org/abs/2605.30334v1
- Code: https://github.com/microsoft/data-efficacy/
- Publication Date: May 28, 2026

**Core Insights**
Data organization (sorting and presentation order) has long been overlooked in large model training, but it is crucial in single-epoch training scenarios. This study proposes four data organization principles: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity, and develops two innovative methods—STR (Stratified Sorting) and SAW (Sawtooth Sorting). Experiments show these methods can reduce perplexity by 2-5%, improve downstream task accuracy by 1-3%, and enhance training stability and convergence speed.

## Research Background: Why Does Data Order Matter for Large Model Training?

### Specificity of Single-Epoch Training
- **No Repeat Learning Opportunity**: Each sample appears only once; once missed, it is permanently lost.
- **Amplified Order Dependency**: Early samples deeply influence the initial learning direction, and path dependency effects persist.
- **Sensitivity to Learning Dynamics**: Samples have a greater impact when the learning rate is high in the early training stage.

### Cognitive Science Inspiration
Curriculum Learning shows that progressive learning from simple to complex is more effective and applicable to LLM training.

### Existing Research Gaps
- **Scale Challenge**: Lack of efficient sorting strategies for trillion-token-level data.
- **Diversity Challenge**: Text data is diverse, making it hard to measure difficulty with a single dimension.
- **Evaluation Challenge**: LLM multi-capability evaluation requires comprehensive metrics, making it difficult to measure sorting effects with a single indicator.

## Core Principles: Four Guidelines for Data Organization

1. **Boundary Sharpening**: Gradually focus on high-quality data—use loose quality thresholds in the early training stage and raise them later, similar to "sharpening" data boundaries.
2. **Cyclic Scheduling**: Periodically repeat data patterns (not identical samples), combine with curriculum learning to achieve spiral improvement and strengthen memory.
3. **Curriculum Continuity**: Maintain difficulty/topic continuity between adjacent samples to reduce context switching costs and improve learning efficiency.
4. **Local Diversity**: Ensure data diversity within small windows to balance continuity and generalization ability and avoid over-adaptation.

## Innovative Methods: Detailed Explanation of STR Stratified Sorting and SAW Sawtooth Sorting

### STR (Stratified Sorting)
- **Steps**: Quality scoring → Stratification → Intra-layer continuous sorting → Progressive introduction → Cyclic scheduling.
- **Advantages**: Clear stratification, progressive approach aligns with cognitive rules, cyclic reinforcement, and intra-layer continuity improves efficiency.

### SAW (Sawtooth Sorting)
- **Steps**: Difficulty assessment → Sawtooth pattern generation (rise-fall within a cycle) → Diversity injection → Dynamic adjustment.
- **Advantages**: Sawtooth pattern provides review opportunities, fluctuations prevent over-adaptation, and dynamic adjustment enhances robustness.

### Method Selection
- STR: Suitable for scenarios with obvious data quality differences and a need for interpretable processes.
- SAW: Suitable for scenarios with large difficulty differences and a need for natural curriculum curves.

## Experimental Validation: Robust Results Across Scales and Stages

### Experimental Design
- **Model Scale**: 1B → 70B parameters.
- **Data Scale**: Billions → trillions of tokens.
- **Stages**: Pre-training + Supervised Fine-Tuning (SFT).
- **Baselines**: Random shuffle, simple curriculum learning, existing state-of-the-art methods.
- **Metrics**: Perplexity, downstream accuracy, training stability, convergence speed.

### Main Results
- **Performance Improvement**: Perplexity reduced by 2-5%, downstream tasks improved by 1-3%.
- **Stability**: Smoother loss curves and more stable gradients.
- **Convergence Speed**: 10-20% fewer steps.
- **Cross-Scale/Stage**: Smaller models show more obvious improvements; effective in both pre-training and SFT.

### Principle Validation
Ablation experiments confirm each of the four principles contributes independently, and their combination produces synergistic effects.

## Practical Guide: How to Apply Data Organization Principles and Methods?

### Implementation Steps
1. **Quality Assessment**: Use pre-trained models to compute perplexity or scoring models.
2. **Difficulty Assessment**: Define difficulty indicators (length, complexity, etc.).
3. **Strategy Selection**: Choose STR for large quality differences; choose SAW for large difficulty differences.
4. **Implement Sorting**: Generate offline order files to ensure big data efficiency.
5. **Training Monitoring**: Compare with baselines and monitor loss and validation performance.
6. **Iterative Optimization**: Adjust parameters and customize task strategies.

### Cost Considerations
Extra computation is minimal: Scores are reused from preprocessing, sorting is an offline operation, and no training modifications are needed.

### Combination with Other Technologies
Can be combined with data selection, augmentation, and curriculum learning to enhance results.

## Limitations and Future Directions: Next Steps in Data Organization Research

### Current Limitations
- Dependence on precomputed scores, which may have biases.
- Domain specificity: Effective for general text but needs adjustment for specific domains.
- Static order: Lack of real-time dynamic adjustment.
- Insufficient theoretical understanding: Not deeply exploring the relationship between model learning dynamics.

### Future Directions
- **Online Data Organization**: Adjust order in real time.
- **Multi-Objective Optimization**: Balance performance, efficiency, and fairness.
- **Personalized Strategies**: Customize for different models/tasks.
- **Cross-Modal Extension**: Apply to multi-modal training.
- **Theoretical Analysis**: Establish a strict theoretical framework.
