Reading

XiangQi-LLM-Arena: Evaluating Long-Range Reasoning Capabilities of Large Language Models Using Chinese Chess

An open-source scientific benchmark environment that quantitatively evaluates the long-range logical reasoning capabilities of large language models through Chinese Chess games.

中国象棋LLM评估基准测试长程推理Pikafish多步推理数据污染量化评估PyQt6NNUE

Published 2026-05-29 23:06Recent activity 2026-05-29 23:22Estimated read 7 min

XiangQi-LLM-Arena: Evaluating Long-Range Reasoning Capabilities of Large Language Models Using Chinese Chess

Section 01

[Introduction] XiangQi-LLM-Arena: Evaluating LLM Long-Range Reasoning Capabilities Using Chinese Chess

Introducing XiangQi-LLM-Arena—an open-source scientific benchmark environment designed to quantitatively evaluate the long-range logical reasoning capabilities of large language models (LLMs) through Chinese Chess games. This project addresses issues like data contamination and subjective standards in traditional evaluation benchmarks, providing an objective and contamination-resistant evaluation platform for LLM reasoning capabilities.

Section 02

Background: Challenges in LLM Reasoning Evaluation and the Potential of Chinese Chess

As LLM capabilities improve, objectively evaluating their reasoning abilities has become a core issue. Traditional benchmarks have flaws such as data contamination and subjective evaluation standards. Chinese Chess, with its unique characteristics (e.g., long-range dependencies, no risk of data contamination), has emerged as a new gold standard for evaluating LLM long-range reasoning capabilities.

Section 03

Core Research Questions and Reasons for Choosing Chinese Chess

Core Questions: How do state-of-the-art LLMs perform in reasoning over complex discrete game states with long-range causal dependencies? Reasons for Selection:

No risk of data contamination: Large branching factor (about 40 legal moves per step) and unique game positions avoid model memorization.
Long-range dependencies: Winning strategies require planning 10-30 steps ahead, testing multi-step reasoning abilities.
Quantifiable standards: Uses the Pikafish engine (superhuman level, based on NNUE) to provide objective metrics like centipawn loss.
Clear illegal moves: The illegal move rate directly measures the model's understanding of rules.

Section 04

System Architecture and Functional Features

XiangQi-LLM-Arena provides a complete testing environment with core features including:

Interactive chessboard interface: Based on PyQt6, supporting move highlighting, legal move prompts, animation effects, etc.
LLM Arena mode: LLM plays against Pikafish, with configurable thinking time, search depth, and difficulty.
Real-time evaluation system: Provides real-time charts for WDL probability, centipawn score, engine evaluation value, etc.
Research recorder: Outputs game data (moves, token consumption, latency, centipawn loss, etc.) in JSONL format.
Multi-provider support: Compatible with OpenAI, Anthropic Claude, and OpenAI-compatible APIs (DeepSeek, Qwen, etc.).
Statistical dashboard: Automatically calculates metrics like illegal move rate, average centipawn loss, and token usage.
Random baseline: Built-in random agent for comparative testing.

Section 05

Technical Implementation Details

Pikafish Engine Integration: A Chinese Chess engine based on the Stockfish architecture, using the NNUE neural network evaluation function to provide objective quality standards. Detailed Evaluation Metrics:

Centipawn Loss: Measures the gap between the LLM's move and the engine's optimal move (1 centipawn = 1% of a pawn's value; lower loss is better).
Illegal Move Rate: The frequency of illegal moves proposed by the LLM, reflecting its understanding of rules.
WDL Evaluation: The engine's assessment of the current position's win/draw/loss probability.

Section 06

Research Significance and Application Value

Contributions to LLM Research:

Contamination-resistant evaluation benchmark; 2. Long-range reasoning testbed; 3. Objective performance metrics; 4. Grounding capability detection. Practical Application Scenarios:

Model comparison; 2. Exploration of capability boundaries; 3. Training effect verification; 4. Prompt engineering optimization.

Section 07

Usage and Extension

The project is developed in Python, relying on PyQt6 and the OpenAI API. Researchers can:

Connect their own LLM API keys;
Configure game parameters;
Export game data for analysis;
Extend support to other chess variants or games.

Section 08

Conclusion: Towards More Reliable LLM Evaluation

XiangQi-LLM-Arena represents an important evolution in LLM evaluation methods. By using Chinese Chess—a game with clear rules, quantifiable results, and resistance to contamination—as a benchmark, it helps researchers accurately understand the real reasoning capabilities of models. As LLMs are applied in critical fields, reliable and objective evaluation benchmarks become increasingly important. This project provides a valuable tool for promoting the development of AI research towards rigor and verifiability.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15