Section 01
[Introduction] SRPO: A New Reinforcement Learning Framework Combining the Advantages of GRPO and SDPO
Researchers propose Sample Routing Policy Optimization (SRPO) to address the pain points of existing reinforcement learning post-training methods: the coarse-grained credit assignment of GRPO and the long-term stability issue of SDPO. By intelligently routing correct and failed samples, SRPO combines the stability of GRPO and the fine-grained supervision of SDPO. Experiments show that SRPO achieves an average performance improvement of 3.4%-6.3% on Qwen3-8B while reducing computational cost by 17.2%, providing an efficient new solution for large model post-training.