Core Insight: Belief Detection
Judge the reasoning direction by measuring changes in the model's conditional probability of the correct answer during reasoning: rising belief → positive contribution, falling → error/deviation, stagnant → redundant.
Technical Implementation
- Reasoning Segmentation: Divide steps based on natural language or logical structure;
- Belief Detection: Calculate the model's conditional probability of the correct answer at segment boundaries;
- Progress Measurement: Evaluate paragraph contributions by comparing belief changes between adjacent segments.
Advantages
- Model-agnostic: Directly uses the main model's probability estimation without additional parameters;
- Zero extra cost: No Monte Carlo sampling needed, reducing computational overhead;
- High interpretability: Paragraph-level progress facilitates understanding and debugging.
Training Process
Integrate paragraph-level progress into GRPO training: assign higher advantage estimates to paragraphs with positive progress, penalize those with falling beliefs, and encourage conciseness for redundant paragraphs to improve sample efficiency.