Section 01
[Introduction] Key Points of the Benchmark Test on LLM Probabilistic Reasoning Capabilities
This study conducts a benchmark test on the discrete probabilistic reasoning capabilities of large language models (LLMs). The results show: LLMs achieve an average accuracy of 96% on regular probability problems, but drop sharply to 59% when facing counterintuitive problems; models are extremely sensitive to prompt wording, with wording changes leading to a performance drop of over 20%; chain-of-thought (CoT) prompts have limited improvement on counterintuitive problems. The research source is the paper "How reliable are LLMs when it comes to playing dice?" published on arXiv on June 5, 2026 (link: http://arxiv.org/abs/2606.07515v1), which reminds that LLMs should be used cautiously in high-risk decision-making fields.