# Reasoning Structure of Large Language Models: A New Evaluation Paradigm Beyond Accuracy and Token Count

> The study proposes an evaluation method that transforms reasoning processes into verifiable reasoning graphs. Through structural metrics, it distinguishes differences in reasoning behaviors that traditional metrics (accuracy, token count) cannot identify, providing a new tool for diagnosing failure modes and comparing reasoning scalability.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-02T16:49:19.000Z
- 最近活动: 2026-06-03T04:25:14.723Z
- 热度: 133.4
- 关键词: 大语言模型, 推理评估, 逻辑推理, 可解释性, 基准测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/token-b0c6ba8a
- Canonical: https://www.zingnex.cn/forum/thread/token-b0c6ba8a
- Markdown 来源: floors_fallback

---

## [Introduction] A New Evaluation Paradigm for the Reasoning Structure of Large Language Models

### Core Insights
The study proposes an evaluation method that transforms the reasoning process of Large Language Models (LLMs) into verifiable reasoning graphs. Through structural metrics (e.g., reasoning efficiency, topological features), it distinguishes differences in reasoning behaviors that traditional metrics (accuracy, token count) cannot identify, providing a new tool for diagnosing failure modes and comparing reasoning scalability.

### Original Authors and Sources
- Original Authors: Paper author team (arXiv:2606.03883v1)
- Source Platform: arXiv
- Original Title: Reasoning Structure of Large Language Models
- Original Link: http://arxiv.org/abs/2606.03883v1
- Publication Time: June 2, 2026

## Evaluation Dilemma: Blind Spots of Traditional Metrics

The evaluation of Large Reasoning Models (LRMs) has long relied on final answer accuracy and token consumption. However, the same accuracy and token count may mask fundamentally different reasoning structures:
- Two models may achieve the same score, but one derives conclusions through a rigorous logical chain while the other may guess by chance or use shortcut heuristics;
- Traditional metrics cannot distinguish these essentially different reasoning processes.

## Method: Reasoning Graph Construction and Topological Analysis

#### Construction of Reasoning Graphs
Transform unstructured reasoning trajectories into verifiable reasoning graphs, which include two types of elements:
- **Claims**: Propositions, assumptions, or intermediate conclusions in the reasoning process;
- **Dependencies**: Logical support or derivation relationships between claims.

#### Topological Analysis Tools
Apply graph theory tools to analyze the features of reasoning graphs:
- Path length: The depth of reasoning from initial assumptions to final conclusions;
- Branching factor: The degree of parallel exploration in the reasoning process;
- Connectivity: The completeness and redundancy of reasoning chains;
- Key nodes: Core claims that play a decisive role in the conclusion.

## Technical Implementation: Key Steps from Trajectory to Graph

Implementing the new evaluation paradigm requires solving three technical challenges:
1. **Trajectory Parsing**: Extract structured claims and dependencies from chain-of-thought outputs (combining natural language understanding and logical parsing);
2. **Graph Validation**: Ensure the reasoning graph is logically consistent and semantically aligned with the original trajectory;
3. **Scalability**: Benchmark tests cover diverse puzzle types and difficulty levels to ensure result generalization.

## Experimental Findings: Unique Value of Structural Metrics

Analysis of open-source models reveals three key values of structural metrics:
1. **Distinguish Confusing Behaviors**: Under the same accuracy/token count, identify differences between systematic reasoning and intuitive leaps, compact structures and scattered redundancy;
2. **Diagnose Failure Modes**: Locate problems through broken chain analysis (missing logic), cycle detection (repeated arguments), and isolated claims (no valid connections);
3. **Analyze Reasoning Scalability**: Compare reasoning graph features across puzzles of different difficulty levels to evaluate how model capabilities scale with complexity (e.g., structural stability).

## Research Significance: Shift in Evaluation Paradigm and Model Improvement

#### Evolution of Evaluation Paradigm
Shift from "result-oriented" to "process-oriented": Future evaluation needs to focus on "how to get the right answer" rather than just "whether the answer is right".

#### Guidance for Model Improvement
Reasoning efficiency can serve as a new optimization goal to cultivate models' concise and systematic reasoning abilities.

#### Enhanced Interpretability
Reasoning graphs help humans understand the model's thinking process and identify biases or error patterns.

#### New Dimension for Cross-Model Comparison
Structural metrics reveal differences in model characteristics that traditional metrics cannot detect (e.g., impacts of architecture and training methods).

## Summary: Value of the New Paradigm and Future Outlook

This study pioneers a new evaluation paradigm for LLMs through reasoning graph transformation. Structural metrics (reasoning efficiency, topological analysis) can effectively distinguish reasoning behavior differences that traditional metrics cannot identify, providing practical tools for diagnosing failure modes and comparing scalability. As LLMs are increasingly applied in critical decision-making scenarios, understanding and evaluating the quality of their reasoning structures will become more important.
