Section 01
Chain-of-Thought Reasoning Evaluation Framework: Systematically Testing the Reasoning Capabilities of Large Language Models (Introduction)
This article introduces the open-source framework llm-evaluation-with-CoT, which aims to systematically evaluate the Chain-of-Thought (CoT) reasoning capabilities of large language models. This framework fills the gap in traditional evaluations that only focus on final answers, deeply analyzes model reasoning quality from perspectives such as process and multi-dimensional aspects, applies to scenarios like model development, selection, and research, and discusses future development directions.