Section 01
[Introduction] FALSIFYBENCH: A New Framework for Testing Large Models' Scientific Reasoning Ability
FALSIFYBENCH is an evaluation framework inspired by the classic Wason 2-4-6 task, designed to test the hypothesis-driven reasoning ability of large language models (LLMs). Key findings include: Reasoning-optimized models outperform instruction-tuned models; models actively seeking falsification are more successful; however, all models still have a significant gap from optimal performance. This framework provides a new perspective for evaluating the scientific reasoning ability of LLMs.