Reading

EvalQReason: Step-Level Reasoning Evaluation for Large Language Models via Probability Distribution Analysis

A three-stage framework for evaluating LLM reasoning quality without manual annotation, introducing two divergence algorithms (CSD and SFC), achieving up to F1=0.98 in correctness prediction in math and medical domains

LLMreasoning evaluationstep-level analysisdivergence metricsCSDSFCAI safetymodel evaluation

Published 2026-06-06 20:01Recent activity 2026-06-06 20:22Estimated read 6 min

EvalQReason: Step-Level Reasoning Evaluation for Large Language Models via Probability Distribution Analysis

Section 01

EvalQReason: Step-Level LLM Reasoning Evaluation via Probability Distribution Analysis (Main Guide)

Core Overview

EvalQReason is a three-stage framework for step-level reasoning evaluation of Large Language Models (LLMs) using probability distribution analysis. It eliminates manual annotation and achieves up to F1=0.98 in correctness prediction for math and medical tasks.

Basic Information

Author: Shaima Ahmad Freja (University of Stavanger)
Source: GitHub
Release Time: 2026 June
Link: https://github.com/Shaima4127/EvalQReason
Contact: shaima.a.freja@uis.no

Section 02

Background: The Need for Step-Level Evaluation

Traditional result-based LLM evaluation has critical flaws:

Correct answers may come from wrong reasoning paths.
Correct reasoning can lead to wrong answers due to calculation errors.
Hallucinations/logical jumps are invisible in result-only assessment.

Manual step-level evaluation is costly and unscalable, so an automated, interpretable method is urgent.

Section 03

Framework: Stages & Key Metrics

Three-Stage Architecture

Reasoning Generation & Logit Extraction: Generate step-by-step chains and extract token logits (saved as .pkl; no closed-source API models like GPT-4).
Reasoning Dynamics Quantification: Compute CSD/SFC using KL/JS divergence, Hellinger distance, cosine similarity, entropy difference.
Pattern Analysis & Prediction: Visualize trajectories, use classic ML (XGBoost) and sequence models (GRU) for correctness prediction.

Key Metrics

CSD: Local consistency between adjacent steps (low=smooth, high=drift).
SFC: Global alignment between steps and final answers.

Section 04

Experimental Design: Datasets & Models

Datasets

Dataset	Domain	Scale	Difficulty
AIME	Math	240	3 levels
Math-500	Math	500	5 levels
MedQA	Medical	1273	2 levels

Models Tested

Qwen2.5-7B-Instruct
MathStral-7B
Qwen-Medicine-7B
Qwen3-4B
Qwen3-8B

Cross-domain/scale design ensures generalizability.

Section 05

Core Results & Key Findings

Best Performance

Algorithm	Model Type	Classifier	Dataset	LLM	F1	ROC-AUC
CSD	Classic ML	XGBoost	AIME	Qwen3-4B	0.91	0.90
CSD	Sequence	GRU	Math-500	Qwen3-8B	0.98	0.90
SFC	Sequence	NN	Math-500	Qwen3-8B	0.98	0.96

Findings

CSD outperforms SFC in most cases.
Sequence models (GRU) beat classic ML.
Stable across 4B-8B model scales.
Math tasks have clearer patterns than medical tasks.

Section 06

Technical Details & Code Release

Hardware

Stage	Requirements	Notes
1	GPU (A100)	Logit extraction
2	CPU (≥64GB)	Large .pkl files
3	CPU	ML training

Code Plan

After paper acceptance: open-source prompt scripts, reasoning generators, divergence tools, ML notebooks, example CSV files.

Section 07

Significance & Applications

EvalQReason enables:

Interpretable Diagnosis: Identify reasoning drift via CSD/SFC trajectories.
Boundary Detection: Find model weaknesses across difficulty levels.
Domain Strategies: Adapt evaluation for math/medical tasks.
Data Quality: Filter incoherent reasoning samples.

Section 08

Conclusion: EvalQReason's Value

EvalQReason offers a novel step-level evaluation approach—no manual annotation, using model probability distributions. Its F1=0.98 in math tasks shows great potential for improving LLM reliability, making it valuable for researchers and developers.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49