Reading

Why do large language models "zone out" in multi-step reasoning? — A diagnostic study on program execution faithfulness

大语言模型程序执行推理能力多步推理AI评估忠实性基准测试机器学习

Published 2026-05-02 01:55Recent activity 2026-05-05 02:19Estimated read 13 min

Why do large language models "zone out" in multi-step reasoning? — A diagnostic study on program execution faithfulness

Section 01

[Introduction] The "zoning out" phenomenon in multi-step reasoning of large language models — A diagnostic study on program execution faithfulness

A new study reveals hidden flaws in LLMs' step-by-step program execution: although the final answer may be correct, models often fail to faithfully follow the instruction flow, with accuracy dropping sharply as the number of steps increases. By constructing a diagnostic benchmark, the study analyzes the models' failure modes and points out that current LLMs have systematic bottlenecks in long-range program execution, which has important warning implications for high-risk application scenarios.

Section 02

Research Background: Hidden Risks in the Process Behind Correct Answers

Large language models (LLMs) perform well in various reasoning benchmarks, from mathematical problem-solving to code generation, seemingly demonstrating strong logical thinking abilities. However, a fundamental question has long been ignored: when a model gives a correct answer, does it really faithfully execute the program according to the steps we specified? A latest diagnostic study from research institutions points out that the accuracy of the final answer does not reflect the model's degree of faithful execution of instructions. In other words, the model may get the correct answer through "shortcuts" or "guessing" instead of strictly following our preset reasoning path. This phenomenon has important warning implications for application scenarios requiring precise program execution—such as scientific computing, financial analysis, and automated decision-making.

Section 03

Diagnostic Method: Building a Stress Test Benchmark for Program Execution

To systematically evaluate LLMs' program execution capabilities, the research team designed an elaborate controlled diagnostic benchmark. The core setting of this benchmark is: provide the model with a step-by-step arithmetic algorithm and two numerical inputs, and require the model to return the final calculated value. The design of this benchmark has several key features:

Combination of simple operations and complex structures: The algorithm only uses basic arithmetic operations (addition, subtraction, multiplication, division), but increases complexity through two mechanisms—extension of algorithm length and "look-back dependencies" between intermediate variables. The latter means that subsequent steps may need to reference intermediate results from multiple previous steps, simulating variable reuse scenarios in real programming.

Finely controlled difficulty gradient: The test covers algorithm lengths from 5 steps to 95 steps, forming a clear difficulty progression. This allows researchers to accurately measure how model performance decays as complexity increases.

Extensive validation across multiple models and datasets: The study covers 14 different language models and 55 dataset variants to ensure the generality of the conclusions.

Section 04

Key Findings: Faithfulness Drops Sharply as Steps Increase

The research results reveal a worrying trend: As the number of program steps increases, the model's execution accuracy drops sharply.

Quantified Performance Decay

Data shows that the average first-answer accuracy drops from 61% for 5-step programs to 20% for 95-step programs. This nearly linear decay curve indicates that current LLMs have systematic bottlenecks in handling long-range program execution. It is worth noting that the "accuracy" here measures whether the model strictly follows the given steps, not whether the final value is correct.

Five Typical Execution Failure Modes

Through fine-grained analysis at the generation level, researchers identified five types of typical errors in the model's execution process:

1. Missing Answers: The model skips the output of certain steps during execution, leading to a broken reasoning chain.

2. Premature Answers: The model gives an answer before completing all steps, showing an "impatient" execution tendency.

3. Self-correction after Initial Error: The model tries to correct after an initial mistake, but this correction often disrupts the original flow of the program, leading to confusion in subsequent steps.

4. Under-executed Traces: The model claims to have completed certain steps, but actually does not perform the corresponding calculation operations.

5. Hallucinated Extra Steps: The model adds non-existent steps on its own, deviating from the given algorithm.

These failure modes collectively point to a core problem: LLMs lack stable "execution discipline" when executing long-range programs, and are easily affected by internal generation dynamics to deviate from the intended path.

Section 05

Deep Insights: The Gap Between Apparent Reasoning and Real Execution

The most important contribution of this study is to reveal the significant gap between apparent reasoning ability and real program execution ability.

Challenges to Existing Evaluation Methods

Traditional LLM evaluation mainly focuses on the correctness of the final answer, and this "result-oriented" evaluation method may seriously overestimate the model's real ability. A model may guess the correct answer through pattern matching or statistical correlation, but never really understand or execute the required reasoning process.

This is particularly important for high-risk application scenarios. In medical diagnosis, legal analysis, or engineering calculations, the interpretability and auditability of the process are often as important as the correctness of the result. If the model cannot be ensured to faithfully execute the specified program, the reliability of its output will be greatly reduced.

Reflection on Model Architecture

The research results also trigger deep thinking about the current Transformer architecture. The autoregressive generation mechanism makes the model face a tension between "continuing generation" and "following instructions" at each step. As the generated sequence lengthens, this tension may cause the model to gradually "zone out", prioritizing local fluency over global faithfulness.

Section 06

Practical Recommendations and Future Research Directions

Recommendations for Application Developers

For developers building LLM-based application systems, this study provides several practical recommendations:

Decompose complex tasks: Split long-range programs into shorter subprograms, and ensure the correct execution of each step through explicit intermediate checkpoints.

Add execution verification: Introduce external verification mechanisms at key steps, such as code interpreters or symbolic computation engines, instead of relying entirely on the model's self-declared execution.

Design process-aware prompts: Explicitly require the model to show intermediate calculation steps in the prompt, and strictly standardize their format for subsequent parsing and verification.

Implications for Model Research

From a research perspective, this work opens up several directions worth exploring in depth:

Neural mechanisms of program execution: Through intervention experiments, explore the specific circuits activated by the model when executing programs, and understand the neural basis differences between "faithful execution" and "taking shortcuts".

Bias analysis of training data: Investigate the proportion and quality of program code and natural language in pre-trained corpora, and analyze whether this affects the model's execution discipline.

Possibilities for architecture improvement: Explore ways to enhance the model's ability to follow long-range structured instructions while maintaining the advantages of autoregressive generation, such as introducing explicit execution stacks or memory mechanisms.

Section 07

Conclusion: A Calm Reflection on LLM Reliability

This diagnostic study reveals the real limitations of current large language models in program execution in a calm and precise way. It reminds us that while we are amazed by the "intelligent" appearance of LLMs, we should not ignore their basic reliability issues as computing systems.

For practitioners and researchers who aim to apply LLMs to serious production environments, this work is a timely warning: before deploying these powerful models, we need to deeply understand their failure modes and build corresponding safety guarantee mechanisms. After all, an intelligent assistant that occasionally "zones out" may bring far greater risks in critical tasks than a system with limited capabilities but predictable behavior.