Zing Forum

Reading

PUMA: Stop When Reasoning Converges—A Semantic-Preserving Early Exit Mechanism for Reasoning Models

PUMA determines the convergence timing by detecting semantic redundancy in the reasoning chain. While maintaining answer accuracy and reasoning chain integrity, it reduces token generation by an average of 26.2%, significantly improving the efficiency of reasoning models.

推理模型早退机制思维链语义冗余过度思考推理效率CoT优化
Published 2026-05-18 06:04Recent activity 2026-05-19 10:56Estimated read 6 min
PUMA: Stop When Reasoning Converges—A Semantic-Preserving Early Exit Mechanism for Reasoning Models
1

Section 01

Introduction: PUMA—A Semantic-Preserving Early Exit Mechanism for Reasoning Models

PUMA is a semantic-preserving early exit mechanism for reasoning models. It determines the convergence timing by detecting semantic redundancy in the reasoning chain. While maintaining answer accuracy and reasoning chain integrity, it reduces token generation by an average of 26.2%, significantly improving the efficiency of reasoning models. This mechanism addresses the "overthinking" problem of Large Reasoning Models (LRMs) and provides a new perspective for efficient reasoning.

2

Section 02

Background: Overthinking in Reasoning Models and Limitations of Existing Methods

The Overthinking Problem

Large reasoning models rely on Long Chain of Thought (CoT) for complex reasoning, but often generate redundant steps even after the solution stabilizes, wasting computing resources, increasing latency, and leading to lengthy reasoning chains.

Limitations of Existing Methods

Existing early exit methods rely on answer-level signals (confidence, answer consistency), which reflect answer readiness rather than reasoning convergence. This easily leads to premature exit (compromising accuracy) or incomplete semantic reasoning chains.

3

Section 03

PUMA Framework: Dual-Safeguard Design of Redundancy Detection and Answer Verification

Core Insight

Reasoning-level semantic redundancy is a convergence signal: when consecutive steps repeat existing conclusions, the reasoning trajectory has converged (analogous to humans stopping thinking when going in circles).

Key Components

  1. Lightweight Redundancy Detector: Encodes reasoning steps into semantic vectors, calculates similarity between consecutive steps, and marks redundancy if the threshold is exceeded (lightweight design ensures low overhead).
  2. Answer-Level Verification: Checks answer stability, confidence, and reasoning chain integrity.

Dual-Safeguard Mechanism

Early exit is only allowed when both redundancy detection and answer verification are satisfied, balancing safety and efficiency.

4

Section 04

Experimental Results: Significant Efficiency Improvement and Cross-Task Generalization

Evaluations on 5 LRMs and 5 reasoning benchmarks show:

  • Token Reduction: Reduces generated tokens by an average of 26.2% while maintaining answer accuracy and CoT quality.
  • Cross-Task Generalization: Effective in scenarios like code generation, zero-shot vision-language reasoning, and internalizing learning stop strategies, proving that reasoning-level redundancy signals are robust, transferable, and learnable.
5

Section 05

Technical Depth: Key Principles of Semantic-Preserving Early Exit

  1. Semantic-Level vs Token-Level Redundancy: Identifies conceptual repetition (even with different wording) to avoid missing semantically equivalent redundancy.
  2. Reasoning Chain Integrity: Ensures the retained reasoning prefix is a semantically complete argument, not a truncated fragment.
  3. Plug-and-Play Design: Can be applied to various reasoning models without retraining, enhancing practicality.
6

Section 06

Practical Application Value: Cost Reduction and Experience Enhancement

  • Reduce Service Costs: 26% token reduction directly lowers API call costs, increases throughput, and reduces GPU demand.
  • Improve User Experience: Faster response times, more understandable reasoning processes, and clearer answers.
  • Maintain Reasoning Quality: Does not sacrifice answer accuracy, reasoning chain coherence, or self-correction ability.
7

Section 07

Conclusion and Future Directions: New Exploration of Efficient Reasoning

Conclusion

PUMA achieves semantic-preserving early exit through reasoning-level semantic redundancy, not only improving efficiency but also proposing a new perspective: effective reasoning requires knowing when to stop thinking. The open-source code provides a practical tool for the community.

Future Directions

  • Dynamically adjust redundancy thresholds (based on task complexity and domain characteristics).
  • Cross-language semantic redundancy identification.
  • Internalize early exit strategies during training to enable models to learn efficient reasoning patterns.