Zing Forum

Reading

Evaluating Reasoning Consistency of Large Language Models Using Chinese Chess: A New Benchmark for Sequential Decision-Making Scenarios

This article introduces an evaluation framework for large language models based on Chinese Chess, focusing on testing the reasoning consistency of LLMs in sequential decision-making environments, providing a unique cultural perspective and practical tool for AI capability assessment.

大语言模型中国象棋推理一致性评估框架连续决策LLM基准测试JavaMaven
Published 2026-03-31 05:43Recent activity 2026-03-31 05:54Estimated read 6 min
Evaluating Reasoning Consistency of Large Language Models Using Chinese Chess: A New Benchmark for Sequential Decision-Making Scenarios
1

Section 01

[Introduction] A New Benchmark for Evaluating LLM Reasoning Consistency Based on Chinese Chess

This article introduces an evaluation framework for large language models based on Chinese Chess, focusing on testing the reasoning consistency of LLMs in sequential decision-making environments. Combining Chinese cultural characteristics, it provides a unique perspective and practical tool for AI capability assessment. Traditional static question-and-answer evaluations struggle to measure reasoning stability in sequential decision-making, while the sequential decision-making nature of chess is suitable for testing this dimension.

2

Section 02

Background: Why Do We Need a New Evaluation Framework?

As the capabilities of large language models improve, traditional static question-and-answer evaluations struggle to fully measure the real reasoning ability of models, especially in sequential decision-making scenarios where the reasoning consistency of models is often overlooked. Chinese Chess has intuitive and easy-to-understand rules, unique cultural characteristics, and requires optimal decisions based on the current situation at each step. Its sequential decision-making nature is highly similar to real-world application scenarios, making it an ideal testing platform.

3

Section 03

Project Overview: Xiangqi-LLMs-reasoning-consistency

This project is an evaluation framework developed based on Java, using the Maven build system with strong scalability. The core design concept is to transform Chinese Chess games into a standardized testing environment, allowing LLMs to act as players, observe decision-making patterns in multiple rounds of games, and evaluate chess-playing ability and reasoning consistency (whether contradictory decisions are made in similar situations).

4

Section 04

Technical Architecture and Implementation Details

The project's technical architecture is divided into three layers: 1. Chessboard State Representation Layer: Encodes piece positions, turns, historical moves, etc., into a format understandable by LLMs; 2. Interface Adaptation Layer: Unifies access to different LLM providers for seamless model switching; 3. Evaluation Engine: Drives the game process, records decisions, calculates evaluation metrics, and supports single-game analysis, batch games, and consistency-specific tests.

5

Section 05

Evaluation Dimensions of Reasoning Consistency

The project proposes three innovative evaluation dimensions: 1. Situation Stability: Whether the magnitude of decision changes is reasonable when there are minor changes in the situation; 2. Temporal Consistency: Whether the strategy remains coherent during long-term games; 3. Explanation Consistency: Whether the decision explanation matches the actual action.

6

Section 06

Application Scenarios and Practical Value

For model developers: Discover and fix reasoning defects; For researchers: Provide rigorous benchmark tests with distinct cultural characteristics; Practical applications: Migrate to fields requiring consistent long-term decisions such as autonomous driving, medical diagnosis, and financial transactions to improve model reliability.

7

Section 07

Limitations and Future Outlook

Limitations: Currently only supports single-model evaluation, and the calculation of evaluation metrics needs optimization; Future directions: Introduce chess variants to test generalization ability, develop visualization tools, establish public leaderboards, and explore multimodal processing of chessboard images, etc.

8

Section 08

Conclusion

The Xiangqi-LLMs-reasoning-consistency project combines traditional Chinese culture with modern AI evaluation needs, opening up a new path for LLM capability assessment. In the development of AI, it is necessary to pay attention to the reliability and consistency of models in complex decision-making scenarios, and this project is an important step.