# EvalQReason: Step-Level Reasoning Evaluation for Large Language Models via Probability Distribution Analysis

> A three-stage framework for evaluating LLM reasoning quality without manual annotation, introducing two divergence algorithms (CSD and SFC), achieving up to F1=0.98 in correctness prediction in math and medical domains

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-06T12:01:20.000Z
- 最近活动: 2026-06-06T12:22:01.668Z
- 热度: 159.7
- 关键词: LLM, reasoning evaluation, step-level analysis, divergence metrics, CSD, SFC, AI safety, model evaluation
- 页面链接: https://www.zingnex.cn/en/forum/thread/evalqreason
- Canonical: https://www.zingnex.cn/forum/thread/evalqreason
- Markdown 来源: floors_fallback

---

## EvalQReason: Step-Level LLM Reasoning Evaluation via Probability Distribution Analysis (Main Guide)

### Core Overview
EvalQReason is a three-stage framework for step-level reasoning evaluation of Large Language Models (LLMs) using probability distribution analysis. It eliminates manual annotation and achieves up to F1=0.98 in correctness prediction for math and medical tasks.

### Basic Information
- **Author**: Shaima Ahmad Freja (University of Stavanger)
- **Source**: GitHub
- **Release Time**: 2026 June
- **Link**: https://github.com/Shaima4127/EvalQReason
- **Contact**: shaima.a.freja@uis.no

## Background: The Need for Step-Level Evaluation

Traditional result-based LLM evaluation has critical flaws:
- Correct answers may come from wrong reasoning paths.
- Correct reasoning can lead to wrong answers due to calculation errors.
- Hallucinations/logical jumps are invisible in result-only assessment.

Manual step-level evaluation is costly and unscalable, so an automated, interpretable method is urgent.

## Framework: Stages & Key Metrics

#### Three-Stage Architecture
1. **Reasoning Generation & Logit Extraction**: Generate step-by-step chains and extract token logits (saved as .pkl; no closed-source API models like GPT-4).
2. **Reasoning Dynamics Quantification**: Compute CSD/SFC using KL/JS divergence, Hellinger distance, cosine similarity, entropy difference.
3. **Pattern Analysis & Prediction**: Visualize trajectories, use classic ML (XGBoost) and sequence models (GRU) for correctness prediction.

#### Key Metrics
- **CSD**: Local consistency between adjacent steps (low=smooth, high=drift).
- **SFC**: Global alignment between steps and final answers.

## Experimental Design: Datasets & Models

#### Datasets
| Dataset | Domain | Scale | Difficulty |
|---------|--------|-------|------------|
| AIME | Math |240 |3 levels |
| Math-500 | Math |500 |5 levels |
| MedQA | Medical |1273 |2 levels |

#### Models Tested
- Qwen2.5-7B-Instruct
- MathStral-7B
- Qwen-Medicine-7B
- Qwen3-4B
- Qwen3-8B

Cross-domain/scale design ensures generalizability.

## Core Results & Key Findings

#### Best Performance
| Algorithm | Model Type | Classifier | Dataset | LLM | F1 | ROC-AUC |
|-----------|------------|-----------|---------|-----|-----|---------|
| CSD | Classic ML | XGBoost | AIME | Qwen3-4B |0.91 |0.90 |
| CSD | Sequence | GRU | Math-500 | Qwen3-8B |0.98 |0.90 |
| SFC | Sequence | NN | Math-500 | Qwen3-8B |0.98 |0.96 |

#### Findings
1. CSD outperforms SFC in most cases.
2. Sequence models (GRU) beat classic ML.
3. Stable across 4B-8B model scales.
4. Math tasks have clearer patterns than medical tasks.

## Technical Details & Code Release

#### Hardware
| Stage | Requirements | Notes |
|-------|--------------|-------|
|1 | GPU (A100) | Logit extraction |
|2 | CPU (≥64GB) | Large .pkl files |
|3 | CPU | ML training |

#### Code Plan
After paper acceptance: open-source prompt scripts, reasoning generators, divergence tools, ML notebooks, example CSV files.

## Significance & Applications

EvalQReason enables:
1. **Interpretable Diagnosis**: Identify reasoning drift via CSD/SFC trajectories.
2. **Boundary Detection**: Find model weaknesses across difficulty levels.
3. **Domain Strategies**: Adapt evaluation for math/medical tasks.
4. **Data Quality**: Filter incoherent reasoning samples.

## Conclusion: EvalQReason's Value

EvalQReason offers a novel step-level evaluation approach—no manual annotation, using model probability distributions. Its F1=0.98 in math tasks shows great potential for improving LLM reliability, making it valuable for researchers and developers.
