Reading

SU-01: A Simple and Unified Scalable Approach to Achieve Gold Medal-level Reasoning in Olympiads

The research team trained SU-01 using reverse perplexity curriculum learning, two-stage reinforcement learning, and test-time expansion, with only a 30B-A3B backbone model and 340K trajectory data, achieving gold medal-level performance in IMO and IPhO competitions.

奥赛推理强化学习课程学习SU-01数学推理物理推理测试时扩展

Published 2026-05-13 18:13Recent activity 2026-05-14 12:52Estimated read 7 min

SU-01: A Simple and Unified Scalable Approach to Achieve Gold Medal-level Reasoning in Olympiads

Section 01

Introduction: Core Breakthroughs of SU-01 in Achieving Gold Medal-level Reasoning in Olympiads

SU-01 is a reasoning model developed by the research team. It is trained with a 30B-A3B backbone model (mixture-of-experts architecture) and 340K reasoning trajectory data through three core strategies: reverse perplexity curriculum learning, two-stage reinforcement learning, and test-time expansion. The model achieves gold medal-level performance in the International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO), proving that medium-sized models can also master complex scientific reasoning abilities and providing new possibilities for the democratization of reasoning models.

Section 02

Background: AI Challenges in Olympiad Reasoning and Limitations of Existing Methods

The International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) represent the highest level of human logical thinking. Their problems require deep knowledge reserves, creative decomposition, rigorous reasoning, and precise calculation, which were once insurmountable fortresses for AI. In recent years, AI has made breakthroughs in Olympiad performance, but existing methods rely on complex processes, massive data, and large models, leading to extremely high training costs. The research team raised the question: Is there a simpler and unified method to achieve gold medal-level performance with reasonable resource investment?

Section 03

Core Training Method of SU-01: A Three-Stage Formula

SU-01's training formula includes three core stages:

Reverse Perplexity Curriculum Learning: Organize training data from high to low perplexity, allowing the model to learn the most difficult reasoning patterns first, build robust proof search and self-checking abilities, and avoid simple pattern matching;
Two-Stage Reinforcement Learning: First optimize verifiable rewards (such as answer correctness and proof completeness) to consolidate basic abilities, then perform fine-grained proof-level optimization (focusing on elegance, conciseness, and logical rigor);
Test-Time Expansion: Generate longer reasoning chains (over 100,000 tokens) during reasoning, conduct multi-path exploration and verification, and dynamically allocate computing resources to promising directions.

Section 04

Experimental Evidence: Gold Medal-level Performance and Key Ability Demonstrations

SU-01 uses a mixture-of-experts (MoE) architecture with 30B active parameters and 3B active parameters as its backbone. The training data consists of 340,000 reasoning trajectories shorter than 8K tokens, and reinforcement learning involves only 200 update steps. Its performance:

Math competitions: Achieves gold medal level in IMO 2025 and USAMO 2026;
Physics competitions: Achieves gold medal level in IPhO 2024 and 2025;
Long-range reasoning: Can stably generate reasoning chains of over 100,000 tokens;
Cross-domain generalization: Can handle scientific reasoning problems outside the math and physics training distribution.

Section 05

Methodological Insights: Key Takeaways from SU-01's Success

SU-01's success brings the following insights:

Data quality over quantity: 340,000 carefully selected short trajectories are more effective than millions of low-quality long trajectories;
Curriculum design is critical: The "hard to easy" training order forces the model to learn essential reasoning strategies and avoid overfitting to simple patterns;
Progressive reinforcement learning: The two-stage design from basic ability to fine optimization aligns with the渐进性 of ability building;
Value of test-time computation: Post-training computation expansion can improve performance; the reasoning bottleneck lies not only in model size but also in the utilization of computing resources.

Section 06

Limitations and Future Directions: Shortcomings of SU-01 and Follow-up Research

Limitations: SU-01's performance on some geometry problems needs improvement (possibly related to the representation of geometric proofs in training data); its performance on creative open-ended problems requires further evaluation. Future directions: Expand to more scientific fields such as chemistry and biology; explore the effects of larger-scale models; further reduce training data requirements.

Section 07

Conclusion: Core Contributions and Significance of SU-01

SU-01 achieves gold medal-level performance in Olympiads with a simple unified training formula and relatively restrained resource investment. Its core contribution is proving that medium-sized models can master complex scientific reasoning abilities through well-designed curriculum learning, progressive reinforcement learning, and test-time expansion. This provides new possibilities for the democratization of reasoning models—high-performance reasoning is no longer exclusive to tech giants.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15