# GeoBuildBench: Evaluating Large Models' Ability to Convert Natural Language Geometric Problems into Executable Constructions

> The GeoBuildBench benchmark requires models to generate geometric construction DSL programs from natural language descriptions. Evaluations on 489 Chinese textbook-style problems show that current multimodal models still face issues such as structural hallucinations and constraint satisfaction failures.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-13T08:30:12.000Z
- 最近活动: 2026-05-14T02:51:18.766Z
- 热度: 123.7
- 关键词: 几何构造, 基准测试, 大模型评测, 程序合成, 多模态模型, 可执行推理, DSL, 几何AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/geobuildbench
- Canonical: https://www.zingnex.cn/forum/thread/geobuildbench
- Markdown 来源: floors_fallback

---

## GeoBuildBench: A New Benchmark for Evaluating Executable Geometric Construction from Natural Language

GeoBuildBench is a novel benchmark designed to assess large language models (LLMs) and multimodal agents' ability to convert natural language geometric problems into executable domain-specific language (DSL) programs. Unlike existing benchmarks that focus on answer correctness or static image understanding, it fills the gap by emphasizing the interactive, constructive nature of geometry. The benchmark uses 489 Chinese textbook-style problems and reveals key limitations of current models, such as structural hallucinations and constraint satisfaction failures.

## Limitations of Traditional Geometric AI Benchmarks

Traditional geometric AI benchmarks have two main flaws:
1. **Focus on answer correctness**: They prioritize whether models get the right answer but ignore if the reasoning process is geometrically constructible (models might guess via pattern matching instead of understanding).
2. **Static image understanding**: They focus on analyzing given diagrams, neglecting geometry's dynamic, step-by-step construction nature.
GeoBuildBench addresses these by treating geometric diagrams as interactive tasks requiring executable DSL programs.

## Task Definition, DSL Design, and Dataset Details

**Task**: Convert natural language geometric problems (e.g., "Construct the circumcircle of triangle ABC") into DSL programs that generate valid diagrams meeting all constraints.
**DSL**: Balances expressiveness and executability with basic primitives (points, lines, circles), composite constructs (angle bisectors), constraint declarations, and executability in standard environments.
**Dataset**: 489 carefully selected Chinese middle and high school textbook problems, with quality control (text completeness, constructibility, clear constraints) and covering basic to complex difficulty levels.

## Evaluation Results and Key Challenges

Assessments of state-of-the-art multimodal models show:
- **Limited success**: Models solve some problems but struggle with core issues like object omission, constraint violations (e.g., non-tangent circles labeled as tangent), and hallucinated constructs (inventing unmentioned objects).
- **Poor feedback utilization**: Models fail to effectively correct errors even with explicit feedback, suggesting surface pattern matching over deep geometric reasoning.
Challenges include semantic gaps (implied geometric knowledge), program synthesis complexity, verifiability requirements, and combinatorial step coordination.

## Research Significance of GeoBuildBench

GeoBuildBench goes beyond geometric problem-solving to evaluate grounded reasoning:
- **Groundedness**: Anchors understanding in executable formal representations.
- **Verifiability**: Uses constraint solvers for objective evaluation.
- **Interpretability**: DSL programs serve as transparent reasoning traces.
- **Practicality**: Applicable to education software and CAD tools.

## Future Directions and Open Source Initiative

Future research directions include:
1. Combining construction and proof tasks.
2. Interactive learning with feedback for model improvement.
3. Better multimodal fusion of text and visual reasoning.
4. Extending to 3D geometry and analytic geometry.
The benchmark and code are open-sourced to encourage community contributions.
