Zing Forum

Reading

HSIR: Making Self-Improvement of Large Reasoning Models Truly Effective

HSIR addresses the issues of data imbalance and overthinking in self-improvement training through the "Verify-Exit" sampling strategy and intrinsic diversity scoring, significantly improving reasoning performance while reducing inference overhead.

HSIR大推理模型自我改进GRPO数据不平衡过度思考强化学习
Published 2026-05-24 18:54Recent activity 2026-05-26 13:27Estimated read 8 min
HSIR: Making Self-Improvement of Large Reasoning Models Truly Effective
1

Section 01

[Introduction] HSIR: Making Self-Improvement of Large Reasoning Models Both Efficient and Effective

Core Information

  • Source: Paper Better, Faster: Harnessing Self-Improvement in Large Reasoning Models published on arXiv on May 24, 2026 (Link: http://arxiv.org/abs/2605.24998v1)
  • Core Problems: Two major dilemmas in self-improvement of large reasoning models: data imbalance (more simple samples, fewer difficult samples) and overthinking (redundant reasoning steps)
  • Solution: HSIR uses a two-pronged approach: "Verify-Exit" sampling strategy and intrinsic diversity scoring
  • Effects: Average reasoning performance improved by 10.9%, relative inference overhead reduced by 42.4%, and applicable to multiple post-training paradigms
2

Section 02

Background: The Ideal and Real-World Dilemmas of Large Model Self-Improvement

The Ideal of Self-Improvement

Large Reasoning Models (LRMs) are expected to achieve continuous improvement without external supervision through self-generated reasoning trajectories, which seems like a shortcut to intelligence.

Real-World Dilemmas

In practice, this method performs poorly or even fails on complex tasks, rooted in two key issues:

  1. Data Imbalance: Self-generated data is dominated by simple samples, while critical difficult samples are scarce, leading the model to stay in its comfort zone and struggle to break through its capability boundaries.
  2. Overthinking: A large number of redundant reasoning steps are used in training, making the model learn to generate verbose and inefficient solutions, reducing efficiency and easily introducing errors.
3

Section 03

Core Methods of HSIR: Two-Pronged Approach to Solve the Two Major Problems

Method 1: Verify-Exit Sampling Strategy

To address data imbalance, the model verifies intermediate results when generating reasoning trajectories. If a path cannot lead to the correct answer, it exits and tries a new path, ensuring sufficient high-quality difficult samples are collected.

Method 2: Intrinsic Diversity Scoring

Quantify the diversity and necessity of reasoning steps, filter out redundant and verbose samples, and retain concise and efficient solutions.

H-GRPO Enhancement Algorithm

Treat intrinsic diversity as an external reward to build a dual reward mechanism: reward both correct problem-solving and concise, diverse reasoning processes to balance conciseness and diversity.

4

Section 04

Experimental Evidence: Double Win in Performance and Efficiency

Performance Improvement

Across multiple benchmark tests, HSIR improved reasoning performance by an average of 10.9% with wide applicability.

Efficiency Optimization

Relative inference overhead was reduced by up to 42.4%, achieving the effect of "more accurate and faster".

Cross-Paradigm Universality

HSIR achieved positive results when applied to multiple post-training paradigms such as supervised fine-tuning and reinforcement learning, proving its universality.

5

Section 05

In-Depth Analysis: Three Reasons for HSIR's Effectiveness

  1. Data Quality Improvement: The Verify-Exit strategy filters high-quality difficult samples, avoiding overfitting on low-difficulty samples.
  2. Regularization Effect: Intrinsic diversity scoring penalizes verbose reasoning and encourages more concise and generalizable solutions.
  3. Balance Between Exploration and Exploitation: The dual reward mechanism of H-GRPO uses conciseness rewards to exploit known efficient strategies and diversity rewards to explore new paths.
6

Section 06

Implications for Reasoning Model Training

  1. Data Curation is Crucial: Even self-generated data requires careful selection and balancing; blind use may lead to training failure.
  2. Efficiency and Performance Go Hand in Hand: Traditional research focuses on accuracy; HSIR shows efficiency is also key—practical models need to balance both.
  3. Value of Multi-Objective Optimization: H-GRPO optimizes accuracy and efficiency simultaneously, proving that the multi-objective perspective can be extended to other scenarios.
7

Section 07

Limitations and Future Directions

Limitations

  • The Verify-Exit strategy increases sampling costs, requiring a trade-off between cost and performance.

Future Directions

  1. Refine intrinsic diversity scoring to better capture reasoning quality.
  2. Verify HSIR's transfer effect across different domains and adjust parameters to adapt to specific tasks.
8

Section 08

Conclusion: HSIR Paves the Way for Large Model Self-Improvement

By solving the two core issues of data imbalance and overthinking, HSIR makes the self-improvement of large reasoning models truly effective—boosting reasoning ability while significantly reducing overhead. This research reminds us that self-improvement is not a "free lunch" and requires carefully designed data management and training strategies. HSIR's ideas provide important references for building stronger and more efficient reasoning models, pushing AI toward the direction of "better thinking and more efficient thinking".