Section 01
Introduction: KCSAT-ML—A New Reasoning Model Evaluation Benchmark Based on Real Human Difficulty Signals
KCSAT-ML is a reasoning model evaluation benchmark built from ten years of math questions from the Korean College Scholastic Ability Test (KCSAT). Its core advantages include introducing real human difficulty signals (official per-question error rates from hundreds of thousands of examinees' data); proposing the DRG metric to reveal alignment differences between models and human difficulty perception; and discovering key conclusions such as the double-edged effect of test-time scaling, providing a new perspective for evaluating mathematical reasoning models.