Zing Forum

Reading

SPS: Enhancing the Exploration Capability of Large Model Reasoning via Probability Squeezing Guidance

Addressing the problem in RL training where single-sample performance improves but diverse exploration is limited, the SPS paradigm is proposed. By alternately using traditional RL and inverse reinforcement learning to reshape the trajectory distribution, it improves Pass@k performance on five reasoning benchmarks and reveals the inherent upper limit of exploration.

Reinforcement LearningInverse RLExplorationPass@kReasoning ModelsProbability SqueezingMathematical ReasoningLLM Training
Published 2026-04-18 21:49Recent activity 2026-04-21 09:53Estimated read 6 min
SPS: Enhancing the Exploration Capability of Large Model Reasoning via Probability Squeezing Guidance
1

Section 01

Introduction: SPS—A New Paradigm for Enhancing the Exploration Capability of Large Model Reasoning

Addressing the problem in RL training where single-sample performance improves but diverse exploration is limited, we propose the SPS (Steering Probability Squeezing) training paradigm. By alternately using traditional RL and inverse reinforcement learning (IRL) to reshape the trajectory distribution, it improves Pass@k performance on five reasoning benchmarks and reveals the inherent upper limit of exploration.

2

Section 02

Background: Exploration Dilemma in RL Training

Reinforcement Learning (RL) is a promising paradigm for training reasoning-oriented large language models, but there exists a tension between single-sample performance (Pass@1) and diverse exploration (Pass@k). Traditional RL training often improves Pass@1 but restricts the exploration of diverse reasoning trajectories, leading to the probability squeezing effect: probability mass is excessively concentrated on a few high-reward trajectories, suppressing truly potential alternative paths and narrowing the exploration space.

3

Section 03

Core Method: Alternating Training Strategy of the SPS Paradigm

The SPS paradigm reshapes the trajectory distribution by alternately using traditional RL and inverse reinforcement learning (IRL):

  1. RL phase: Optimize the policy using verifiable rewards to increase the probability of high-value trajectories;
  2. IRL phase: Use samples from the current policy as demonstrations, without external supervision, to identify and increase the probability of undervalued trajectories, countering the squeezing effect;
  3. Alternating iteration: Achieve dynamic balance between exploration and exploitation.
4

Section 04

Experimental Evidence: Performance Validation on Five Reasoning Benchmarks

Evaluations on five benchmarks—GSM8K (elementary school math), MATH (competition-level math), SVAMP (math word problems), StrategyQA (commonsense reasoning), and CommonsenseQA (commonsense question answering)—show that SPS consistently outperforms baseline methods, improves Pass@k performance, maintains Pass@1 competitiveness, and enhances solution diversity.

5

Section 05

In-depth Analysis: Inherent Upper Limit of Exploration Capability

The study identifies an empirical Pass@k upper limit, revealing the inherent constraints of the exploration capability of RL-based reasoning models and providing reference boundaries for model design. The causes of the upper limit may include limitations in the expressive power of the policy network, sparsity of reward signals, coverage of training data, and convergence characteristics of optimization algorithms.

6

Section 06

Design Insights and Training Recommendations

Design insights of SPS: The alternation frequency needs to be balanced (too frequent leads to instability, too sparse fails to counter the squeezing effect; adaptive adjustment is recommended); compared to other methods, it requires no additional data, has controllable computational overhead, and clear theoretical motivation. Training recommendations: Monitor changes in policy entropy, introduce regularization when detecting the squeezing effect, and adopt multi-stage training strategies to alternately optimize different objectives.

7

Section 07

Limitations and Future Research Directions

Current limitations: Hyperparameter sensitivity (alternation frequency and IRL intensity need careful tuning), increased computational overhead, and insufficient theoretical convergence analysis. Future directions: Develop adaptive SPS mechanisms, establish theoretical guarantees, extend to fields such as code generation/scientific reasoning, and explore the synergistic effects with other exploration techniques.