# SRPO: A New Reinforcement Learning Framework Combining the Advantages of GRPO and SDPO

> Researchers propose Sample Routing Policy Optimization (SRPO), which intelligently routes correct and failed samples, combining the stability of GRPO and the fine-grained supervision of SDPO. On Qwen3-8B, it achieves an average performance improvement of 3.4%-6.3% while reducing computational cost by 17.2%.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-02T17:29:18.000Z
- 最近活动: 2026-04-03T04:49:50.936Z
- 热度: 130.7
- 关键词: SRPO, GRPO, SDPO, 强化学习, 大语言模型, 后训练, 样本路由, 策略优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/srpo-grposdpo
- Canonical: https://www.zingnex.cn/forum/thread/srpo-grposdpo
- Markdown 来源: floors_fallback

---

## [Introduction] SRPO: A New Reinforcement Learning Framework Combining the Advantages of GRPO and SDPO

Researchers propose Sample Routing Policy Optimization (SRPO) to address the pain points of existing reinforcement learning post-training methods: the coarse-grained credit assignment of GRPO and the long-term stability issue of SDPO. By intelligently routing correct and failed samples, SRPO combines the stability of GRPO and the fine-grained supervision of SDPO. Experiments show that SRPO achieves an average performance improvement of 3.4%-6.3% on Qwen3-8B while reducing computational cost by 17.2%, providing an efficient new solution for large model post-training.

## Dilemmas of Existing Reinforcement Learning Post-Training Methods

The current mainstream method GRPO has a coarse-grained credit assignment problem: it uniformly penalizes failed samples and cannot locate specific error tokens. The emerging method SDPO achieves fast convergence through logit-level supervision, but has long-term training instability issues, rooted in the optimization ambiguity of self-distillation from correct samples and the degradation of self-teacher signals over time.

## Core Mechanisms and Technical Implementation of SRPO

The core of SRPO is intelligent sample routing: routing correct samples to the GRPO branch (to reinforce correct behaviors) and failed samples to the SDPO branch (to finely correct errors). It also introduces entropy-aware dynamic weighting: assigning high weights to low-entropy (high-confidence) self-distillation signals in failed samples and suppressing high-entropy unreliable signals. In algorithm implementation, it jointly optimizes GRPO loss and weighted SDPO loss while maintaining the on-policy training property.

## Experimental Verification Results of SRPO

On the Qwen3-8B model, SRPO's average score across 5 authoritative benchmark tests (mathematical reasoning, code generation, etc.) is 3.4% higher than GRPO and 6.3% higher than SDPO; computational cost is reduced by 17.2%. Moreover, the advantages are consistent on the Qwen3-32B model, proving cross-scale effectiveness.

## Impact and Insights of SRPO on the Industry

SRPO promotes the evolution of post-training paradigms, improving model capabilities without additional resources; deepens the understanding of sample quality and inspires targeted training strategies; its open-source implementation is expected to become the next-generation mainstream post-training algorithm, driving the improvement of open-source large models.

## Limitations and Future Research Directions of SRPO

SRPO is currently mainly applicable to verifiable reward tasks (such as mathematics, code), and its applicability to open-ended generation tasks remains to be verified; entropy weight calculation brings additional overhead. Future directions include expanding to more task types, exploring finer-grained sample classification, and combining with offline RL and other technologies to further improve performance.
