Section 01
[Introduction] A New Benchmark for Evaluating LLM Reasoning Consistency Based on Chinese Chess
This article introduces an evaluation framework for large language models based on Chinese Chess, focusing on testing the reasoning consistency of LLMs in sequential decision-making environments. Combining Chinese cultural characteristics, it provides a unique perspective and practical tool for AI capability assessment. Traditional static question-and-answer evaluations struggle to measure reasoning stability in sequential decision-making, while the sequential decision-making nature of chess is suitable for testing this dimension.