Zing Forum

Reading

EvalQReason: Step-Level Reasoning Evaluation for Large Language Models via Probability Distribution Analysis

A three-stage framework for evaluating LLM reasoning quality without manual annotation, introducing two divergence algorithms (CSD and SFC), achieving up to F1=0.98 in correctness prediction in math and medical domains

LLMreasoning evaluationstep-level analysisdivergence metricsCSDSFCAI safetymodel evaluation
Published 2026-06-06 20:01Recent activity 2026-06-06 20:22Estimated read 6 min
EvalQReason: Step-Level Reasoning Evaluation for Large Language Models via Probability Distribution Analysis
1

Section 01

EvalQReason: Step-Level LLM Reasoning Evaluation via Probability Distribution Analysis (Main Guide)

Core Overview

EvalQReason is a three-stage framework for step-level reasoning evaluation of Large Language Models (LLMs) using probability distribution analysis. It eliminates manual annotation and achieves up to F1=0.98 in correctness prediction for math and medical tasks.

Basic Information

2

Section 02

Background: The Need for Step-Level Evaluation

Traditional result-based LLM evaluation has critical flaws:

  • Correct answers may come from wrong reasoning paths.
  • Correct reasoning can lead to wrong answers due to calculation errors.
  • Hallucinations/logical jumps are invisible in result-only assessment.

Manual step-level evaluation is costly and unscalable, so an automated, interpretable method is urgent.

3

Section 03

Framework: Stages & Key Metrics

Three-Stage Architecture

  1. Reasoning Generation & Logit Extraction: Generate step-by-step chains and extract token logits (saved as .pkl; no closed-source API models like GPT-4).
  2. Reasoning Dynamics Quantification: Compute CSD/SFC using KL/JS divergence, Hellinger distance, cosine similarity, entropy difference.
  3. Pattern Analysis & Prediction: Visualize trajectories, use classic ML (XGBoost) and sequence models (GRU) for correctness prediction.

Key Metrics

  • CSD: Local consistency between adjacent steps (low=smooth, high=drift).
  • SFC: Global alignment between steps and final answers.
4

Section 04

Experimental Design: Datasets & Models

Datasets

Dataset Domain Scale Difficulty
AIME Math 240 3 levels
Math-500 Math 500 5 levels
MedQA Medical 1273 2 levels

Models Tested

  • Qwen2.5-7B-Instruct
  • MathStral-7B
  • Qwen-Medicine-7B
  • Qwen3-4B
  • Qwen3-8B

Cross-domain/scale design ensures generalizability.

5

Section 05

Core Results & Key Findings

Best Performance

Algorithm Model Type Classifier Dataset LLM F1 ROC-AUC
CSD Classic ML XGBoost AIME Qwen3-4B 0.91 0.90
CSD Sequence GRU Math-500 Qwen3-8B 0.98 0.90
SFC Sequence NN Math-500 Qwen3-8B 0.98 0.96

Findings

  1. CSD outperforms SFC in most cases.
  2. Sequence models (GRU) beat classic ML.
  3. Stable across 4B-8B model scales.
  4. Math tasks have clearer patterns than medical tasks.
6

Section 06

Technical Details & Code Release

Hardware

Stage Requirements Notes
1 GPU (A100) Logit extraction
2 CPU (≥64GB) Large .pkl files
3 CPU ML training

Code Plan

After paper acceptance: open-source prompt scripts, reasoning generators, divergence tools, ML notebooks, example CSV files.

7

Section 07

Significance & Applications

EvalQReason enables:

  1. Interpretable Diagnosis: Identify reasoning drift via CSD/SFC trajectories.
  2. Boundary Detection: Find model weaknesses across difficulty levels.
  3. Domain Strategies: Adapt evaluation for math/medical tasks.
  4. Data Quality: Filter incoherent reasoning samples.
8

Section 08

Conclusion: EvalQReason's Value

EvalQReason offers a novel step-level evaluation approach—no manual annotation, using model probability distributions. Its F1=0.98 in math tasks shows great potential for improving LLM reliability, making it valuable for researchers and developers.