Zing Forum

Reading

SU-01: A Simple and Unified Scalable Approach to Achieve Gold Medal-level Reasoning in Olympiads

The research team trained SU-01 using reverse perplexity curriculum learning, two-stage reinforcement learning, and test-time expansion, with only a 30B-A3B backbone model and 340K trajectory data, achieving gold medal-level performance in IMO and IPhO competitions.

奥赛推理强化学习课程学习SU-01数学推理物理推理测试时扩展
Published 2026-05-13 18:13Recent activity 2026-05-14 12:52Estimated read 7 min
SU-01: A Simple and Unified Scalable Approach to Achieve Gold Medal-level Reasoning in Olympiads
1

Section 01

Introduction: Core Breakthroughs of SU-01 in Achieving Gold Medal-level Reasoning in Olympiads

SU-01 is a reasoning model developed by the research team. It is trained with a 30B-A3B backbone model (mixture-of-experts architecture) and 340K reasoning trajectory data through three core strategies: reverse perplexity curriculum learning, two-stage reinforcement learning, and test-time expansion. The model achieves gold medal-level performance in the International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO), proving that medium-sized models can also master complex scientific reasoning abilities and providing new possibilities for the democratization of reasoning models.

2

Section 02

Background: AI Challenges in Olympiad Reasoning and Limitations of Existing Methods

The International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) represent the highest level of human logical thinking. Their problems require deep knowledge reserves, creative decomposition, rigorous reasoning, and precise calculation, which were once insurmountable fortresses for AI. In recent years, AI has made breakthroughs in Olympiad performance, but existing methods rely on complex processes, massive data, and large models, leading to extremely high training costs. The research team raised the question: Is there a simpler and unified method to achieve gold medal-level performance with reasonable resource investment?

3

Section 03

Core Training Method of SU-01: A Three-Stage Formula

SU-01's training formula includes three core stages:

  1. Reverse Perplexity Curriculum Learning: Organize training data from high to low perplexity, allowing the model to learn the most difficult reasoning patterns first, build robust proof search and self-checking abilities, and avoid simple pattern matching;
  2. Two-Stage Reinforcement Learning: First optimize verifiable rewards (such as answer correctness and proof completeness) to consolidate basic abilities, then perform fine-grained proof-level optimization (focusing on elegance, conciseness, and logical rigor);
  3. Test-Time Expansion: Generate longer reasoning chains (over 100,000 tokens) during reasoning, conduct multi-path exploration and verification, and dynamically allocate computing resources to promising directions.
4

Section 04

Experimental Evidence: Gold Medal-level Performance and Key Ability Demonstrations

SU-01 uses a mixture-of-experts (MoE) architecture with 30B active parameters and 3B active parameters as its backbone. The training data consists of 340,000 reasoning trajectories shorter than 8K tokens, and reinforcement learning involves only 200 update steps. Its performance:

  • Math competitions: Achieves gold medal level in IMO 2025 and USAMO 2026;
  • Physics competitions: Achieves gold medal level in IPhO 2024 and 2025;
  • Long-range reasoning: Can stably generate reasoning chains of over 100,000 tokens;
  • Cross-domain generalization: Can handle scientific reasoning problems outside the math and physics training distribution.
5

Section 05

Methodological Insights: Key Takeaways from SU-01's Success

SU-01's success brings the following insights:

  1. Data quality over quantity: 340,000 carefully selected short trajectories are more effective than millions of low-quality long trajectories;
  2. Curriculum design is critical: The "hard to easy" training order forces the model to learn essential reasoning strategies and avoid overfitting to simple patterns;
  3. Progressive reinforcement learning: The two-stage design from basic ability to fine optimization aligns with the渐进性 of ability building;
  4. Value of test-time computation: Post-training computation expansion can improve performance; the reasoning bottleneck lies not only in model size but also in the utilization of computing resources.
6

Section 06

Limitations and Future Directions: Shortcomings of SU-01 and Follow-up Research

Limitations: SU-01's performance on some geometry problems needs improvement (possibly related to the representation of geometric proofs in training data); its performance on creative open-ended problems requires further evaluation. Future directions: Expand to more scientific fields such as chemistry and biology; explore the effects of larger-scale models; further reduce training data requirements.

7

Section 07

Conclusion: Core Contributions and Significance of SU-01

SU-01 achieves gold medal-level performance in Olympiads with a simple unified training formula and relatively restrained resource investment. Its core contribution is proving that medium-sized models can master complex scientific reasoning abilities through well-designed curriculum learning, progressive reinforcement learning, and test-time expansion. This provides new possibilities for the democratization of reasoning models—high-performance reasoning is no longer exclusive to tech giants.