Section 01
Introduction: SPS—A New Paradigm for Enhancing the Exploration Capability of Large Model Reasoning
Addressing the problem in RL training where single-sample performance improves but diverse exploration is limited, we propose the SPS (Steering Probability Squeezing) training paradigm. By alternately using traditional RL and inverse reinforcement learning (IRL) to reshape the trajectory distribution, it improves Pass@k performance on five reasoning benchmarks and reveals the inherent upper limit of exploration.