Zing Forum

Reading

Demystifying the Art of Data Organization for Large Model Training: Four Principles and STR/SAW Sorting Methods

Systematically analyzes the impact of data sorting on large model training, proposes four principles—Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity—and introduces two innovative data sorting methods: STR and SAW.

数据组织数据排序LLM训练大语言模型课程学习STRSAW数据策展训练效率arXiv
Published 2026-05-29 01:58Recent activity 2026-05-29 12:27Estimated read 10 min
Demystifying the Art of Data Organization for Large Model Training: Four Principles and STR/SAW Sorting Methods
1

Section 01

[Introduction] Demystifying the Art of Data Organization for Large Model Training: Core Insights and Method Overview

Original Paper Information

Core Insights Data organization (sorting and presentation order) has long been overlooked in large model training, but it is crucial in single-epoch training scenarios. This study proposes four data organization principles: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity, and develops two innovative methods—STR (Stratified Sorting) and SAW (Sawtooth Sorting). Experiments show these methods can reduce perplexity by 2-5%, improve downstream task accuracy by 1-3%, and enhance training stability and convergence speed.

2

Section 02

Research Background: Why Does Data Order Matter for Large Model Training?

Specificity of Single-Epoch Training

  • No Repeat Learning Opportunity: Each sample appears only once; once missed, it is permanently lost.
  • Amplified Order Dependency: Early samples deeply influence the initial learning direction, and path dependency effects persist.
  • Sensitivity to Learning Dynamics: Samples have a greater impact when the learning rate is high in the early training stage.

Cognitive Science Inspiration

Curriculum Learning shows that progressive learning from simple to complex is more effective and applicable to LLM training.

Existing Research Gaps

  • Scale Challenge: Lack of efficient sorting strategies for trillion-token-level data.
  • Diversity Challenge: Text data is diverse, making it hard to measure difficulty with a single dimension.
  • Evaluation Challenge: LLM multi-capability evaluation requires comprehensive metrics, making it difficult to measure sorting effects with a single indicator.
3

Section 03

Core Principles: Four Guidelines for Data Organization

  1. Boundary Sharpening: Gradually focus on high-quality data—use loose quality thresholds in the early training stage and raise them later, similar to "sharpening" data boundaries.
  2. Cyclic Scheduling: Periodically repeat data patterns (not identical samples), combine with curriculum learning to achieve spiral improvement and strengthen memory.
  3. Curriculum Continuity: Maintain difficulty/topic continuity between adjacent samples to reduce context switching costs and improve learning efficiency.
  4. Local Diversity: Ensure data diversity within small windows to balance continuity and generalization ability and avoid over-adaptation.
4

Section 04

Innovative Methods: Detailed Explanation of STR Stratified Sorting and SAW Sawtooth Sorting

STR (Stratified Sorting)

  • Steps: Quality scoring → Stratification → Intra-layer continuous sorting → Progressive introduction → Cyclic scheduling.
  • Advantages: Clear stratification, progressive approach aligns with cognitive rules, cyclic reinforcement, and intra-layer continuity improves efficiency.

SAW (Sawtooth Sorting)

  • Steps: Difficulty assessment → Sawtooth pattern generation (rise-fall within a cycle) → Diversity injection → Dynamic adjustment.
  • Advantages: Sawtooth pattern provides review opportunities, fluctuations prevent over-adaptation, and dynamic adjustment enhances robustness.

Method Selection

  • STR: Suitable for scenarios with obvious data quality differences and a need for interpretable processes.
  • SAW: Suitable for scenarios with large difficulty differences and a need for natural curriculum curves.
5

Section 05

Experimental Validation: Robust Results Across Scales and Stages

Experimental Design

  • Model Scale: 1B → 70B parameters.
  • Data Scale: Billions → trillions of tokens.
  • Stages: Pre-training + Supervised Fine-Tuning (SFT).
  • Baselines: Random shuffle, simple curriculum learning, existing state-of-the-art methods.
  • Metrics: Perplexity, downstream accuracy, training stability, convergence speed.

Main Results

  • Performance Improvement: Perplexity reduced by 2-5%, downstream tasks improved by 1-3%.
  • Stability: Smoother loss curves and more stable gradients.
  • Convergence Speed: 10-20% fewer steps.
  • Cross-Scale/Stage: Smaller models show more obvious improvements; effective in both pre-training and SFT.

Principle Validation

Ablation experiments confirm each of the four principles contributes independently, and their combination produces synergistic effects.

6

Section 06

Practical Guide: How to Apply Data Organization Principles and Methods?

Implementation Steps

  1. Quality Assessment: Use pre-trained models to compute perplexity or scoring models.
  2. Difficulty Assessment: Define difficulty indicators (length, complexity, etc.).
  3. Strategy Selection: Choose STR for large quality differences; choose SAW for large difficulty differences.
  4. Implement Sorting: Generate offline order files to ensure big data efficiency.
  5. Training Monitoring: Compare with baselines and monitor loss and validation performance.
  6. Iterative Optimization: Adjust parameters and customize task strategies.

Cost Considerations

Extra computation is minimal: Scores are reused from preprocessing, sorting is an offline operation, and no training modifications are needed.

Combination with Other Technologies

Can be combined with data selection, augmentation, and curriculum learning to enhance results.

7

Section 07

Limitations and Future Directions: Next Steps in Data Organization Research

Current Limitations

  • Dependence on precomputed scores, which may have biases.
  • Domain specificity: Effective for general text but needs adjustment for specific domains.
  • Static order: Lack of real-time dynamic adjustment.
  • Insufficient theoretical understanding: Not deeply exploring the relationship between model learning dynamics.

Future Directions

  • Online Data Organization: Adjust order in real time.
  • Multi-Objective Optimization: Balance performance, efficiency, and fairness.
  • Personalized Strategies: Customize for different models/tasks.
  • Cross-Modal Extension: Apply to multi-modal training.
  • Theoretical Analysis: Establish a strict theoretical framework.