The research results reveal a worrying trend: As the number of program steps increases, the model's execution accuracy drops sharply.
Quantified Performance Decay
Data shows that the average first-answer accuracy drops from 61% for 5-step programs to 20% for 95-step programs. This nearly linear decay curve indicates that current LLMs have systematic bottlenecks in handling long-range program execution. It is worth noting that the "accuracy" here measures whether the model strictly follows the given steps, not whether the final value is correct.
Five Typical Execution Failure Modes
Through fine-grained analysis at the generation level, researchers identified five types of typical errors in the model's execution process:
1. Missing Answers: The model skips the output of certain steps during execution, leading to a broken reasoning chain.
2. Premature Answers: The model gives an answer before completing all steps, showing an "impatient" execution tendency.
3. Self-correction after Initial Error: The model tries to correct after an initial mistake, but this correction often disrupts the original flow of the program, leading to confusion in subsequent steps.
4. Under-executed Traces: The model claims to have completed certain steps, but actually does not perform the corresponding calculation operations.
5. Hallucinated Extra Steps: The model adds non-existent steps on its own, deviating from the given algorithm.
These failure modes collectively point to a core problem: LLMs lack stable "execution discipline" when executing long-range programs, and are easily affected by internal generation dynamics to deviate from the intended path.