Zing Forum

Reading

Diagnosis of Formal Reasoning Capabilities of Large Language Models: Regular Language Tests Reveal 11 Systematic Failure Modes

A systematic study on GPT-5.2, Grok-4.1, Gemini-2.5, and Qwen2.5 identifies 11 systematic failure modes in symbolic reasoning of large language models through regular languages—a fully verifiable formal domain—and proposes the VGNS intervention framework.

大语言模型形式化推理正则语言失效模式分析符号推理模型评估微调表示工程
Published 2026-05-10 22:42Recent activity 2026-05-10 22:51Estimated read 6 min
Diagnosis of Formal Reasoning Capabilities of Large Language Models: Regular Language Tests Reveal 11 Systematic Failure Modes
1

Section 01

[Main Post/Introduction] Diagnosis of Formal Reasoning Capabilities of Large Language Models: Regular Language Tests Reveal 11 Failure Modes and Intervention Framework

This study systematically evaluates the symbolic reasoning capabilities of GPT-5.2, Grok-4.1, Gemini-2.5, and Qwen2.5 series models through regular languages—a fully verifiable formal domain—identifies 11 failure modes, and proposes the VGNS (Vector-Guided Neuron Selection) intervention framework. The results provide important references for evaluating the boundaries and optimizing the formal reasoning capabilities of LLMs.

2

Section 02

Research Background: Rationality of Regular Languages as a Test Benchmark

Large language models perform well in tasks like code generation and mathematical reasoning, but their formal reasoning boundaries are unclear. As the simplest class of formal languages in computational theory, regular languages have fully verifiable properties (whether a string belongs to a regular language can be deterministically judged), making them an ideal sandbox for testing the symbolic reasoning of LLMs. The study constructs a diagnostic benchmark of 180 questions to test the capability differences of mainstream models across different complexity levels.

3

Section 03

Testing Method: Design of a Four-Tier Progressive Difficulty Framework

The study designs a four-tier testing framework corresponding to different cognitive complexities:

  • Tier1: Basic regular expression understanding (combinations of character classes, quantifiers, etc.)
  • Tier2: Constructive tasks (converting natural language to regular expressions/finite automata)
  • Tier3: Equivalence verification and conversion (judging regular expression equivalence, converting between different representation forms)
  • Tier4: Full subset construction (NFA to DFA, requiring tracking of the power set state space)
4

Section 04

Core Evidence: Classification of 11 Systematic Failure Modes

The study identifies 11 failure modes, divided into three categories: Constructive tasks: Anchor hallucination, nullability neglect, atomic unit blindness, scope and nesting confusion; Derivation process: Pseudo-structure hallucination, simple path bias, complexity avoidance; Verification phase: Trace forgery, greedy parsing failure, index and position drift, description-operation misalignment.

5

Section 05

Fine-Tuning Intervention Results: Comparison Between Chain-of-Thought and Non-Chain-of-Thought

Fine-tuning experiments on Qwen2.5 models show:

  • Under CoT settings, the 7B model achieves 100% accuracy in Tier1-3 with an overall accuracy of 96.5%, but only 82.9% in Tier4;
  • No-CoT training performs better in Tier4: the 14B model reaches 97.7% accuracy in Tier4 and an overall accuracy of 98.0%. This challenges the intuition that "chain-of-thought always helps complex reasoning", and it is speculated that direct input-output mapping is more effective for algorithmic step tasks.
6

Section 06

VGNS Intervention Framework: An Attempt to Improve Complex Reasoning

To address the challenges of Tier4 tasks, the VGNS framework is proposed: by analyzing internal activation differences between successful and failed cases, identify "good neurons" and enhance their contribution through activation patching during reasoning. After 4 iterations, the Tier4 accuracy increases from 85.3% to 87.7%, which is better than other intervention methods but with limited improvement—indicating that deep limitations may stem from architecture or training data.

7

Section 07

Research Conclusions: Implications for the Boundaries of LLM Formal Reasoning

Research implications:

  1. Value of fully verifiable domain testing: Regular language tasks have clear right/wrong standards, making failure analysis more objective;
  2. Nonlinearity between scale and reasoning ability: Larger models perform better in Tier1-3, but the Tier4 bottleneck has little relation to scale;
  3. Flaws still exist in simple formal domains: Application risks in safety-critical systems need to be警惕 (noted).
8

Section 08

Future Directions and Open-Source Contributions

The research team has open-sourced experimental code, datasets, training configurations, etc. Future directions include: exploring the Tier4 performance of larger models, developing training data for subset construction tasks, studying the impact of multi-modal inputs, and extending the framework to context-free languages and other more complex formal languages.