Zing Forum

Reading

LLM Math Hallucination Evaluator: Unveiling Algebraic Vulnerabilities of Large Language Models via Perturbation Testing

A research framework based on perturbation testing that systematically detects the 'surface dependency' hallucinations of large language models in mathematical reasoning. It reveals the systematic vulnerability of models with an overall accuracy rate of only 57.6% and verifies the effectiveness of intervention strategies such as symbolic scaffolding.

LLMmath hallucinationperturbation testingalgebrasymbolic reasoningAI reliabilitySymPyprompt engineering
Published 2026-05-15 23:56Recent activity 2026-05-16 00:04Estimated read 5 min
LLM Math Hallucination Evaluator: Unveiling Algebraic Vulnerabilities of Large Language Models via Perturbation Testing
1

Section 01

[Introduction] LLM Math Hallucination Evaluator: Uncovering Surface Dependency Issues in Algebraic Reasoning

This article introduces the llm-math-hallucination-evaluator project, which systematically detects the 'surface dependency' hallucinations of large language models in mathematical reasoning through a perturbation testing framework. The study found that the overall accuracy rate of models is only 57.6%, showing significant systematic vulnerability, and verified that intervention strategies such as symbolic scaffolding can effectively improve model reliability.

2

Section 02

Project Background and Core Issues

Traditional hallucination research mostly focuses on factual errors, while mathematical hallucinations are more subtle: models' answers fluctuate under different expressions of equivalent mathematical problems, reflecting their reliance on surface forms rather than deep logic. The core insight of this project is: a model that truly understands mathematics should give consistent answers to equivalent problems; otherwise, it has 'surface dependency' hallucinations.

3

Section 03

Technical Architecture: Perturbation Engine and Evaluation System

The project's technical architecture consists of three parts:

  1. Perturbation Engine: Generates 10 expression variants (6 semantic-preserving, 4 adversarial traps) to test model consistency;
  2. Hallucination Classification System: Classifies errors into 6 categories, such as external variable invention (most frequently triggered by identity multiplication traps), domain transformation, etc.;
  3. Evaluation Metrics: Expression Consistency Score (ECS), accuracy rate, robustness score (weighted by consistency and accuracy).
4

Section 04

Experimental Results: Systematic Vulnerability and Intervention Effects

Results from large-scale experiments (900 queries):

  • Overall accuracy rate is only 57.6%;
  • Adversarial traps are significantly destructive (identity multiplication triggered 52 external variable invention errors);
  • Symbolic Scaffolding Strategy: Completely eliminated hallucinations in the DeepSeek-Chat model, with a robustness score of 1.0.
5

Section 05

Technology Stack and Implementation Details

The project is developed using Python, relying on the SymPy library for symbolic math parsing and canonical normalization (serving as the 'gold standard' for answer correctness); it integrates models from multiple LLM providers via the OpenRouter API, simplifying the cross-model comparison process.

6

Section 06

Practical Significance and Application Scenarios

The research has important value in multiple fields:

  • Education: Helps screen reliable AI learning tools to avoid misleading students;
  • Code Generation: Guides developers in designing prompts and verification mechanisms;
  • Scientific Research: Provides a reliability assessment tool for AI-assisted data analysis.
7

Section 07

Summary and Future Outlook

This project reveals that LLM mathematical reasoning has systematic vulnerabilities, but intervention strategies like symbolic scaffolding are effective. Future directions include: refining hallucination classification, exploring more intervention methods, and expanding evaluation to mathematical fields such as geometry/probability. This framework helps practitioners evaluate model capability boundaries and make informed technical choices.