Section 01
Benchmark Test on Probabilistic Reasoning Abilities of Large Language Models: Excellent Performance on Standard Questions, Counterintuitive Questions Expose Core Flaws
This article interprets a systematic benchmark study on the probabilistic reasoning abilities of large language models. The research team evaluated 8 mainstream models and found that the models achieved an accuracy rate of 96% on standard discrete probability problems, but this dropped sharply to 59% on counterintuitive ones. It also revealed the significant impact of token bias (performance decreased by over 20% after replacing words with semantically equivalent alternatives) and misleading prompts (performance decreased by 34%) on model performance. The original authors are Luca Avena, Gianmarco Bet, and Bernardo Busoni; the source is arXiv (published on 2026-06-05, link: https://arxiv.org/abs/2606.07515).