Zing Forum

Reading

XiangQi-LLM-Arena: Evaluating Long-Range Reasoning Capabilities of Large Language Models Using Chinese Chess

An open-source scientific benchmark environment that quantitatively evaluates the long-range logical reasoning capabilities of large language models through Chinese Chess games.

中国象棋LLM评估基准测试长程推理Pikafish多步推理数据污染量化评估PyQt6NNUE
Published 2026-05-29 23:06Recent activity 2026-05-29 23:22Estimated read 7 min
XiangQi-LLM-Arena: Evaluating Long-Range Reasoning Capabilities of Large Language Models Using Chinese Chess
1

Section 01

[Introduction] XiangQi-LLM-Arena: Evaluating LLM Long-Range Reasoning Capabilities Using Chinese Chess

Introducing XiangQi-LLM-Arena—an open-source scientific benchmark environment designed to quantitatively evaluate the long-range logical reasoning capabilities of large language models (LLMs) through Chinese Chess games. This project addresses issues like data contamination and subjective standards in traditional evaluation benchmarks, providing an objective and contamination-resistant evaluation platform for LLM reasoning capabilities.

2

Section 02

Background: Challenges in LLM Reasoning Evaluation and the Potential of Chinese Chess

As LLM capabilities improve, objectively evaluating their reasoning abilities has become a core issue. Traditional benchmarks have flaws such as data contamination and subjective evaluation standards. Chinese Chess, with its unique characteristics (e.g., long-range dependencies, no risk of data contamination), has emerged as a new gold standard for evaluating LLM long-range reasoning capabilities.

3

Section 03

Core Research Questions and Reasons for Choosing Chinese Chess

Core Questions: How do state-of-the-art LLMs perform in reasoning over complex discrete game states with long-range causal dependencies? Reasons for Selection:

  1. No risk of data contamination: Large branching factor (about 40 legal moves per step) and unique game positions avoid model memorization.
  2. Long-range dependencies: Winning strategies require planning 10-30 steps ahead, testing multi-step reasoning abilities.
  3. Quantifiable standards: Uses the Pikafish engine (superhuman level, based on NNUE) to provide objective metrics like centipawn loss.
  4. Clear illegal moves: The illegal move rate directly measures the model's understanding of rules.
4

Section 04

System Architecture and Functional Features

XiangQi-LLM-Arena provides a complete testing environment with core features including:

  • Interactive chessboard interface: Based on PyQt6, supporting move highlighting, legal move prompts, animation effects, etc.
  • LLM Arena mode: LLM plays against Pikafish, with configurable thinking time, search depth, and difficulty.
  • Real-time evaluation system: Provides real-time charts for WDL probability, centipawn score, engine evaluation value, etc.
  • Research recorder: Outputs game data (moves, token consumption, latency, centipawn loss, etc.) in JSONL format.
  • Multi-provider support: Compatible with OpenAI, Anthropic Claude, and OpenAI-compatible APIs (DeepSeek, Qwen, etc.).
  • Statistical dashboard: Automatically calculates metrics like illegal move rate, average centipawn loss, and token usage.
  • Random baseline: Built-in random agent for comparative testing.
5

Section 05

Technical Implementation Details

Pikafish Engine Integration: A Chinese Chess engine based on the Stockfish architecture, using the NNUE neural network evaluation function to provide objective quality standards. Detailed Evaluation Metrics:

  • Centipawn Loss: Measures the gap between the LLM's move and the engine's optimal move (1 centipawn = 1% of a pawn's value; lower loss is better).
  • Illegal Move Rate: The frequency of illegal moves proposed by the LLM, reflecting its understanding of rules.
  • WDL Evaluation: The engine's assessment of the current position's win/draw/loss probability.
6

Section 06

Research Significance and Application Value

Contributions to LLM Research:

  1. Contamination-resistant evaluation benchmark; 2. Long-range reasoning testbed; 3. Objective performance metrics; 4. Grounding capability detection. Practical Application Scenarios:
  • Model comparison; 2. Exploration of capability boundaries; 3. Training effect verification; 4. Prompt engineering optimization.
7

Section 07

Usage and Extension

The project is developed in Python, relying on PyQt6 and the OpenAI API. Researchers can:

  • Connect their own LLM API keys;
  • Configure game parameters;
  • Export game data for analysis;
  • Extend support to other chess variants or games.
8

Section 08

Conclusion: Towards More Reliable LLM Evaluation

XiangQi-LLM-Arena represents an important evolution in LLM evaluation methods. By using Chinese Chess—a game with clear rules, quantifiable results, and resistance to contamination—as a benchmark, it helps researchers accurately understand the real reasoning capabilities of models. As LLMs are applied in critical fields, reliable and objective evaluation benchmarks become increasingly important. This project provides a valuable tool for promoting the development of AI research towards rigor and verifiability.