Zing Forum

Reading

GeoBuildBench: Evaluating Large Models' Ability to Convert Natural Language Geometric Problems into Executable Constructions

The GeoBuildBench benchmark requires models to generate geometric construction DSL programs from natural language descriptions. Evaluations on 489 Chinese textbook-style problems show that current multimodal models still face issues such as structural hallucinations and constraint satisfaction failures.

几何构造基准测试大模型评测程序合成多模态模型可执行推理DSL几何AI
Published 2026-05-13 16:30Recent activity 2026-05-14 10:51Estimated read 5 min
GeoBuildBench: Evaluating Large Models' Ability to Convert Natural Language Geometric Problems into Executable Constructions
1

Section 01

GeoBuildBench: A New Benchmark for Evaluating Executable Geometric Construction from Natural Language

GeoBuildBench is a novel benchmark designed to assess large language models (LLMs) and multimodal agents' ability to convert natural language geometric problems into executable domain-specific language (DSL) programs. Unlike existing benchmarks that focus on answer correctness or static image understanding, it fills the gap by emphasizing the interactive, constructive nature of geometry. The benchmark uses 489 Chinese textbook-style problems and reveals key limitations of current models, such as structural hallucinations and constraint satisfaction failures.

2

Section 02

Limitations of Traditional Geometric AI Benchmarks

Traditional geometric AI benchmarks have two main flaws:

  1. Focus on answer correctness: They prioritize whether models get the right answer but ignore if the reasoning process is geometrically constructible (models might guess via pattern matching instead of understanding).
  2. Static image understanding: They focus on analyzing given diagrams, neglecting geometry's dynamic, step-by-step construction nature. GeoBuildBench addresses these by treating geometric diagrams as interactive tasks requiring executable DSL programs.
3

Section 03

Task Definition, DSL Design, and Dataset Details

Task: Convert natural language geometric problems (e.g., "Construct the circumcircle of triangle ABC") into DSL programs that generate valid diagrams meeting all constraints. DSL: Balances expressiveness and executability with basic primitives (points, lines, circles), composite constructs (angle bisectors), constraint declarations, and executability in standard environments. Dataset: 489 carefully selected Chinese middle and high school textbook problems, with quality control (text completeness, constructibility, clear constraints) and covering basic to complex difficulty levels.

4

Section 04

Evaluation Results and Key Challenges

Assessments of state-of-the-art multimodal models show:

  • Limited success: Models solve some problems but struggle with core issues like object omission, constraint violations (e.g., non-tangent circles labeled as tangent), and hallucinated constructs (inventing unmentioned objects).
  • Poor feedback utilization: Models fail to effectively correct errors even with explicit feedback, suggesting surface pattern matching over deep geometric reasoning. Challenges include semantic gaps (implied geometric knowledge), program synthesis complexity, verifiability requirements, and combinatorial step coordination.
5

Section 05

Research Significance of GeoBuildBench

GeoBuildBench goes beyond geometric problem-solving to evaluate grounded reasoning:

  • Groundedness: Anchors understanding in executable formal representations.
  • Verifiability: Uses constraint solvers for objective evaluation.
  • Interpretability: DSL programs serve as transparent reasoning traces.
  • Practicality: Applicable to education software and CAD tools.
6

Section 06

Future Directions and Open Source Initiative

Future research directions include:

  1. Combining construction and proof tasks.
  2. Interactive learning with feedback for model improvement.
  3. Better multimodal fusion of text and visual reasoning.
  4. Extending to 3D geometry and analytic geometry. The benchmark and code are open-sourced to encourage community contributions.