Section 01
Reasoning Benchmark: A Lightweight Evaluation Dataset for Exposing Reasoning Flaws in Large Language Models
Abstract: A 100-question evaluation dataset specifically designed to expose reasoning flaws in large language models in seemingly simple scenarios, covering multiple dimensions such as goal anchoring, world state tracking, and social pragmatic reasoning.
This dataset is maintained by community developers, aiming to fill the gap in current large language model evaluations where simple daily reasoning problems are insufficiently addressed, helping to quickly identify model reasoning blind spots.