# Diagnosis of Formal Reasoning Capabilities of Large Language Models: Regular Language Tests Reveal 11 Systematic Failure Modes

> A systematic study on GPT-5.2, Grok-4.1, Gemini-2.5, and Qwen2.5 identifies 11 systematic failure modes in symbolic reasoning of large language models through regular languages—a fully verifiable formal domain—and proposes the VGNS intervention framework.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-10T14:42:34.000Z
- 最近活动: 2026-05-10T14:51:26.946Z
- 热度: 159.8
- 关键词: 大语言模型, 形式化推理, 正则语言, 失效模式分析, 符号推理, 模型评估, 微调, 表示工程
- 页面链接: https://www.zingnex.cn/en/forum/thread/11
- Canonical: https://www.zingnex.cn/forum/thread/11
- Markdown 来源: floors_fallback

---

## [Main Post/Introduction] Diagnosis of Formal Reasoning Capabilities of Large Language Models: Regular Language Tests Reveal 11 Failure Modes and Intervention Framework

This study systematically evaluates the symbolic reasoning capabilities of GPT-5.2, Grok-4.1, Gemini-2.5, and Qwen2.5 series models through regular languages—a fully verifiable formal domain—identifies 11 failure modes, and proposes the VGNS (Vector-Guided Neuron Selection) intervention framework. The results provide important references for evaluating the boundaries and optimizing the formal reasoning capabilities of LLMs.

## Research Background: Rationality of Regular Languages as a Test Benchmark

Large language models perform well in tasks like code generation and mathematical reasoning, but their formal reasoning boundaries are unclear. As the simplest class of formal languages in computational theory, regular languages have fully verifiable properties (whether a string belongs to a regular language can be deterministically judged), making them an ideal sandbox for testing the symbolic reasoning of LLMs. The study constructs a diagnostic benchmark of 180 questions to test the capability differences of mainstream models across different complexity levels.

## Testing Method: Design of a Four-Tier Progressive Difficulty Framework

The study designs a four-tier testing framework corresponding to different cognitive complexities:
- Tier1: Basic regular expression understanding (combinations of character classes, quantifiers, etc.)
- Tier2: Constructive tasks (converting natural language to regular expressions/finite automata)
- Tier3: Equivalence verification and conversion (judging regular expression equivalence, converting between different representation forms)
- Tier4: Full subset construction (NFA to DFA, requiring tracking of the power set state space)

## Core Evidence: Classification of 11 Systematic Failure Modes

The study identifies 11 failure modes, divided into three categories:
**Constructive tasks**: Anchor hallucination, nullability neglect, atomic unit blindness, scope and nesting confusion;
**Derivation process**: Pseudo-structure hallucination, simple path bias, complexity avoidance;
**Verification phase**: Trace forgery, greedy parsing failure, index and position drift, description-operation misalignment.

## Fine-Tuning Intervention Results: Comparison Between Chain-of-Thought and Non-Chain-of-Thought

Fine-tuning experiments on Qwen2.5 models show:
- Under CoT settings, the 7B model achieves 100% accuracy in Tier1-3 with an overall accuracy of 96.5%, but only 82.9% in Tier4;
- No-CoT training performs better in Tier4: the 14B model reaches 97.7% accuracy in Tier4 and an overall accuracy of 98.0%. This challenges the intuition that "chain-of-thought always helps complex reasoning", and it is speculated that direct input-output mapping is more effective for algorithmic step tasks.

## VGNS Intervention Framework: An Attempt to Improve Complex Reasoning

To address the challenges of Tier4 tasks, the VGNS framework is proposed: by analyzing internal activation differences between successful and failed cases, identify "good neurons" and enhance their contribution through activation patching during reasoning. After 4 iterations, the Tier4 accuracy increases from 85.3% to 87.7%, which is better than other intervention methods but with limited improvement—indicating that deep limitations may stem from architecture or training data.

## Research Conclusions: Implications for the Boundaries of LLM Formal Reasoning

Research implications:
1. Value of fully verifiable domain testing: Regular language tasks have clear right/wrong standards, making failure analysis more objective;
2. Nonlinearity between scale and reasoning ability: Larger models perform better in Tier1-3, but the Tier4 bottleneck has little relation to scale;
3. Flaws still exist in simple formal domains: Application risks in safety-critical systems need to be警惕 (noted).

## Future Directions and Open-Source Contributions

The research team has open-sourced experimental code, datasets, training configurations, etc. Future directions include: exploring the Tier4 performance of larger models, developing training data for subset construction tasks, studying the impact of multi-modal inputs, and extending the framework to context-free languages and other more complex formal languages.
