Zing 论坛

正文

GeoBuildBench:评测大模型将自然语言几何问题转化为可执行构造的能力

GeoBuildBench基准测试要求模型从自然语言描述生成几何构造DSL程序,在489道中国教材风格题目上评估显示,当前多模态模型仍存在结构性幻觉和约束满足失败等问题。

几何构造基准测试大模型评测程序合成多模态模型可执行推理DSL几何AI
发布时间 2026/05/13 16:30最近活动 2026/05/14 10:51预计阅读 5 分钟
GeoBuildBench:评测大模型将自然语言几何问题转化为可执行构造的能力
1

章节 01

GeoBuildBench: A New Benchmark for Evaluating Executable Geometric Construction from Natural Language

GeoBuildBench is a novel benchmark designed to assess large language models (LLMs) and multimodal agents' ability to convert natural language geometric problems into executable domain-specific language (DSL) programs. Unlike existing benchmarks that focus on answer correctness or static image understanding, it fills the gap by emphasizing the interactive, constructive nature of geometry. The benchmark uses 489 Chinese textbook-style problems and reveals key limitations of current models, such as structural hallucinations and constraint satisfaction failures.

2

章节 02

Limitations of Traditional Geometric AI Benchmarks

Traditional geometric AI benchmarks have two main flaws:

  1. Focus on answer correctness: They prioritize whether models get the right answer but ignore if the reasoning process is geometrically constructible (models might guess via pattern matching instead of understanding).
  2. Static image understanding: They focus on analyzing given diagrams, neglecting geometry's dynamic, step-by-step construction nature. GeoBuildBench addresses these by treating geometric diagrams as interactive tasks requiring executable DSL programs.
3

章节 03

Task Definition, DSL Design, and Dataset Details

Task: Convert natural language geometric problems (e.g., "Construct the circumcircle of triangle ABC") into DSL programs that generate valid diagrams meeting all constraints. DSL: Balances expressiveness and executability with basic primitives (points, lines, circles), composite constructs (angle bisectors), constraint declarations, and executability in standard environments. Dataset: 489 carefully selected Chinese初高中 textbook problems, with quality control (text completeness, constructibility, clear constraints) and covering basic to complex difficulty levels.

4

章节 04

Evaluation Results and Key Challenges

Assessments of state-of-the-art multimodal models show:

  • Limited success: Models solve some problems but struggle with core issues like object omission, constraint violations (e.g., non-tangent circles labeled as tangent), and hallucinated constructs (inventing unmentioned objects).
  • Poor feedback utilization: Models fail to effectively correct errors even with explicit feedback, suggesting surface pattern matching over deep geometric reasoning. Challenges include semantic gaps (implied geometric knowledge), program synthesis complexity, verifiability requirements, and combinatorial step coordination.
5

章节 05

Research Significance of GeoBuildBench

GeoBuildBench goes beyond geometric problem-solving to evaluate grounded reasoning:

  • Groundedness: Anchors understanding in executable formal representations.
  • Verifiability: Uses constraint solvers for objective evaluation.
  • Interpretability: DSL programs serve as transparent reasoning traces.
  • Practicality: Applicable to education software and CAD tools.
6

章节 06

Future Directions and Open Source Initiative

Future research directions include:

  1. Combining construction and proof tasks.
  2. Interactive learning with feedback for model improvement.
  3. Better multimodal fusion of text and visual reasoning.
  4. Extending to 3D geometry and analytic geometry. The benchmark and code are open-sourced to encourage community contributions.