# Evaluating Reasoning Consistency of Large Language Models Using Chinese Chess: A New Benchmark for Sequential Decision-Making Scenarios

> This article introduces an evaluation framework for large language models based on Chinese Chess, focusing on testing the reasoning consistency of LLMs in sequential decision-making environments, providing a unique cultural perspective and practical tool for AI capability assessment.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-30T21:43:21.000Z
- 最近活动: 2026-03-30T21:54:30.784Z
- 热度: 159.8
- 关键词: 大语言模型, 中国象棋, 推理一致性, 评估框架, 连续决策, LLM基准测试, Java, Maven
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-bxiao42-xiangqi-llms-reasoning-consistency
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-bxiao42-xiangqi-llms-reasoning-consistency
- Markdown 来源: floors_fallback

---

## [Introduction] A New Benchmark for Evaluating LLM Reasoning Consistency Based on Chinese Chess

This article introduces an evaluation framework for large language models based on Chinese Chess, focusing on testing the reasoning consistency of LLMs in sequential decision-making environments. Combining Chinese cultural characteristics, it provides a unique perspective and practical tool for AI capability assessment. Traditional static question-and-answer evaluations struggle to measure reasoning stability in sequential decision-making, while the sequential decision-making nature of chess is suitable for testing this dimension.

## Background: Why Do We Need a New Evaluation Framework?

As the capabilities of large language models improve, traditional static question-and-answer evaluations struggle to fully measure the real reasoning ability of models, especially in sequential decision-making scenarios where the reasoning consistency of models is often overlooked. Chinese Chess has intuitive and easy-to-understand rules, unique cultural characteristics, and requires optimal decisions based on the current situation at each step. Its sequential decision-making nature is highly similar to real-world application scenarios, making it an ideal testing platform.

## Project Overview: Xiangqi-LLMs-reasoning-consistency

This project is an evaluation framework developed based on Java, using the Maven build system with strong scalability. The core design concept is to transform Chinese Chess games into a standardized testing environment, allowing LLMs to act as players, observe decision-making patterns in multiple rounds of games, and evaluate chess-playing ability and reasoning consistency (whether contradictory decisions are made in similar situations).

## Technical Architecture and Implementation Details

The project's technical architecture is divided into three layers: 1. Chessboard State Representation Layer: Encodes piece positions, turns, historical moves, etc., into a format understandable by LLMs; 2. Interface Adaptation Layer: Unifies access to different LLM providers for seamless model switching; 3. Evaluation Engine: Drives the game process, records decisions, calculates evaluation metrics, and supports single-game analysis, batch games, and consistency-specific tests.

## Evaluation Dimensions of Reasoning Consistency

The project proposes three innovative evaluation dimensions: 1. Situation Stability: Whether the magnitude of decision changes is reasonable when there are minor changes in the situation; 2. Temporal Consistency: Whether the strategy remains coherent during long-term games; 3. Explanation Consistency: Whether the decision explanation matches the actual action.

## Application Scenarios and Practical Value

For model developers: Discover and fix reasoning defects; For researchers: Provide rigorous benchmark tests with distinct cultural characteristics; Practical applications: Migrate to fields requiring consistent long-term decisions such as autonomous driving, medical diagnosis, and financial transactions to improve model reliability.

## Limitations and Future Outlook

Limitations: Currently only supports single-model evaluation, and the calculation of evaluation metrics needs optimization; Future directions: Introduce chess variants to test generalization ability, develop visualization tools, establish public leaderboards, and explore multimodal processing of chessboard images, etc.

## Conclusion

The Xiangqi-LLMs-reasoning-consistency project combines traditional Chinese culture with modern AI evaluation needs, opening up a new path for LLM capability assessment. In the development of AI, it is necessary to pay attention to the reliability and consistency of models in complex decision-making scenarios, and this project is an important step.
