# Beyond Accuracy: A New Multi-Dimensional Framework for Evaluating Reasoning Quality of Large Language Models

> This article introduces a multi-dimensional behavioral framework for evaluating the reasoning quality of large language models, which includes 6 core metrics covering dimensions such as reasoning depth, consistency, and efficiency, and has been validated on 7 mainstream models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-05T15:48:12.000Z
- 最近活动: 2026-06-05T15:52:10.809Z
- 热度: 150.9
- 关键词: 大语言模型, 推理评估, 多维度指标, 模型评测, 逻辑一致性, 推理深度, 机器学习, 自然语言处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-senolali-llm-reasoning-quality-evaluation-metrics
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-senolali-llm-reasoning-quality-evaluation-metrics
- Markdown 来源: floors_fallback

---

## [Introduction] Beyond Accuracy: A New Multi-Dimensional Framework for Evaluating LLM Reasoning Quality

This article introduces a multi-dimensional behavioral framework for evaluating the reasoning quality of large language models (LLMs), which includes 6 core metrics: reasoning depth, logical consistency, factual accuracy, reasoning efficiency, exploration breadth, and conclusion stability. It aims to address the problem that current single-dimensional evaluations (such as accuracy) cannot fully reflect the complex reasoning capabilities of models. This framework has been validated on 7 mainstream models, providing a more comprehensive tool for the evaluation, selection, and improvement of LLMs.

## Background and Motivation

Current LLM evaluations mainly rely on single-dimensional metrics such as accuracy, BLEU scores, or human preference rankings. However, these metrics struggle to fully reflect the real performance of models in complex reasoning tasks, especially in scenarios involving multi-step reasoning, logical coherence, and factual consistency. As LLMs are increasingly applied in high-risk fields like medical diagnosis and legal analysis, the industry urgently needs a multi-dimensional evaluation framework that not only focuses on the correctness of the final answer but also examines the completeness, consistency, and interpretability of the reasoning process.

## Core Dimensions of the Framework (Methodology)

The framework includes 6 core dimensions:
1. **Reasoning Depth**: Measures the level of reasoning, focusing on the length and complexity of the reasoning chain;
2. **Logical Consistency**: Detects self-contradictions in the reasoning process, including coherence between premises and conclusions, and among intermediate steps;
3. **Factual Accuracy**: Evaluates the correctness of external knowledge and facts cited in reasoning;
4. **Reasoning Efficiency**: Examines the number of steps and resource consumption required to reach a correct conclusion;
5. **Exploration Breadth**: Measures the ability to diverge thinking in open-ended problems;
6. **Conclusion Stability**: Detects the consistency of outputs under similar problems (evaluates robustness through minor variations of the problem).

## Experimental Design and Validation (Evidence)

The framework was validated on 7 mainstream models (including open-source and closed-source API models):
- **Dataset**: Covers benchmark test sets in fields such as mathematical reasoning, commonsense reasoning, symbolic reasoning, and code generation;
- **Evaluation Protocol**: Automated evaluation (quantifiable metrics like depth and efficiency) + manual review (semantic-related metrics like consistency and stability);
- **Aggregation Strategy**: Supports deployment-aware weighted aggregation, allowing users to adjust weights of each dimension according to their needs to generate a comprehensive score.

## Key Findings (Conclusions)

The experiments revealed:
1. Accuracy and reasoning quality are not completely positively correlated; some high-accuracy models perform mediocrely in depth and consistency;
2. Different model families have distinct styles: some tend to be depth-first (detailed step-by-step reasoning), while others adopt breadth-first (quickly exploring multiple possibilities);
3. There is a trade-off between reasoning efficiency and quality: over-pursuing efficiency easily leads to overly short reasoning chains, while excessive detail may introduce irrelevant information and reduce consistency.

## Practical Application Value

The framework provides developers and users with:
- **Model Selection**: Focus on corresponding dimensions according to the scenario (e.g., prioritize consistency and factual accuracy for medical applications, and emphasize exploration breadth for creative writing);
- **Improvement Directions**: Identify weak points through fine-grained analysis (e.g., "improving logical consistency" is more specific than the general "improving accuracy");
- **Risk Warning**: A low stability score indicates possible unpredictable behavior in the production environment, requiring additional protection.

## Limitations and Future Directions

Current limitations: Dependence on English datasets; some dimensions (such as exploration breadth) are difficult to evaluate automatically and have high manual costs. Future directions: Expand to multi-modal reasoning scenarios, develop more efficient automated evaluation tools, and adapt to the evolution of new model architectures (e.g., compute expansion during reasoning).
