Section 01
REFT: First-Token Diversification Boosts Exploration Efficiency of RLVR Reasoning Models (Guide)
Title: REFT: Exploring Efficient Reinforcement Learning for Reasoning Models via First-Token Diversification
Core观点: This paper proposes the REFT method, addressing the sampling diversity bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR). By introducing diversified sampling at the first token position after the reasoning prompt, it significantly enhances sampling diversity in a lightweight manner, outperforming DAPO and GRPO baselines across multiple models (0.5B-7B) and difficulty settings.
Source Information: Original authors: arXiv authors; Source platform: arXiv; Original title: Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR; Link: http://arxiv.org/abs/2605.28295v1; Publication time: 2026-05-27T10:46:01Z.