Section 01
[Introduction] LLM Math Hallucination Evaluator: Uncovering Surface Dependency Issues in Algebraic Reasoning
This article introduces the llm-math-hallucination-evaluator project, which systematically detects the 'surface dependency' hallucinations of large language models in mathematical reasoning through a perturbation testing framework. The study found that the overall accuracy rate of models is only 57.6%, showing significant systematic vulnerability, and verified that intervention strategies such as symbolic scaffolding can effectively improve model reliability.