Zing Forum

Reading

AI_Go_LLM: Testing the Limits of Large Language Models' Spatial Reasoning with Go

An innovative evaluation framework that quantitatively tests the real capabilities of Large Language Models (LLMs) in complex spatial reasoning and strategic decision-making tasks by comparing their move recommendations with those of KataGo, a professional Go AI.

大语言模型围棋空间推理KataGoLLM评估DeepSeekSGF决策能力人工智能强化学习
Published 2026-05-14 10:44Recent activity 2026-05-14 11:01Estimated read 6 min
AI_Go_LLM: Testing the Limits of Large Language Models' Spatial Reasoning with Go
1

Section 01

[Introduction] AI_Go_LLM: Exploring the Limits of Large Language Models' Spatial Reasoning with Go

AI_Go_LLM is an innovative evaluation framework that quantitatively tests the real capabilities of Large Language Models (LLMs) in complex spatial reasoning and strategic decision-making tasks by comparing their move recommendations with those of KataGo, a professional Go AI. Go, with its simple rules but extremely complex strategy space, serves as an ideal benchmark for testing AI capabilities. This project aims to answer: Can LLMs, which are primarily trained on text, understand and master Go—a highly structured spatial game?

2

Section 02

Project Background: Why Go Is a Touchstone for LLM Capability Boundaries

Large language models excel at natural language tasks, but their capability boundaries remain to be explored. There are three key reasons Go is chosen as a testing scenario:

  1. Spatial Complexity: The global situation changes on a 19×19 grid require strong spatial perception abilities;
  2. Long-term Planning: Victory depends on strategic layouts over dozens of moves, requiring an understanding of each move's impact on the future;
  3. Creative Decision-making: Finding optimal moves in complex situations is essential. By comparing with KataGo, we can objectively quantify the spatial reasoning performance of LLMs.
3

Section 03

Technical Architecture: End-to-End Evaluation Pipeline Design

AI_Go_LLM adopts a modular architecture covering the complete evaluation pipeline:

  1. Game Record Standardization and Parsing: Use analyze_go.py to process SGF game records, supporting three representation formats: matrix, coordinates, and statistics;
  2. Dataset Construction: make_dataset.py extracts data from the first 6 moves of the opening, outputting JSONL files in Alpaca format;
  3. LLM Integration and Move Recommendation: llm_evaluator.py uses the DeepSeek model to analyze the situation and recommend moves;
  4. KataGo Benchmark Evaluation: evaluate_with_katago.py calls the KataGo engine to obtain benchmark moves;
  5. Evaluation Report Generation: Output results such as consistency ratio, performance analysis, and error statistics.
4

Section 04

Tech Stack and Implementation Details

The project is developed based on Python 3, with key technology selections:

  • SGF Parsing: Use the sgfmill library to process game records;
  • LLM Access: Call the DeepSeek API via the openai library for easy model switching;
  • Go AI: KataGo as the benchmark, with configuration managed via environment variables;
  • Environment Management: python-dotenv to load sensitive information;
  • Data Format: JSONL for storing training data, supporting stream processing.
5

Section 05

Insights from the Evaluation Methodology

The design of AI_Go_LLM provides a methodology for spatial reasoning evaluation:

  1. Domain Expert Benchmark: Professional AIs (like KataGo) serve as objective evaluation standards, which are more scalable than manual annotations;
  2. Multi-dimensional Capability Decomposition: Evaluate the model's performance in different dimensions such as spatial perception and planning through targeted test scenarios;
  3. Interpretability Priority: Require LLMs to provide reasons for their moves to facilitate the identification of cognitive blind spots.
6

Section 06

Future Outlook

With the development of multimodal models, AI_Go_LLM can expand in the following directions:

  • Vision-Language Integration: Combine chessboard images to test visual-spatial understanding;
  • Real-time Gameplay Ability: Evaluate the quality of continuous decision-making in complete games;
  • Teaching Ability Evaluation: Test the model's ability to explain Go concepts and guide learners. Go will continue to push the boundaries of AI exploration.