Section 01
[Introduction] Beyond Accuracy: A New Multi-Dimensional Framework for Evaluating LLM Reasoning Quality
This article introduces a multi-dimensional behavioral framework for evaluating the reasoning quality of large language models (LLMs), which includes 6 core metrics: reasoning depth, logical consistency, factual accuracy, reasoning efficiency, exploration breadth, and conclusion stability. It aims to address the problem that current single-dimensional evaluations (such as accuracy) cannot fully reflect the complex reasoning capabilities of models. This framework has been validated on 7 mainstream models, providing a more comprehensive tool for the evaluation, selection, and improvement of LLMs.