Section 01
[Introduction] Defects in Procedural Execution Capabilities of Large Models: Accuracy Plummets as Steps Increase
This study, through controlled diagnostic benchmark tests, found that large language models have significant capability defects in procedural execution tasks: when the number of program steps increases from 5 to 95, the average first-answer accuracy rate plummets from 61% to 20%. Failure modes include missing answers, premature termination, etc., revealing the issue of execution faithfulness behind superficial reasoning abilities.