Zing Forum

Reading

KCSAT-ML: A Reasoning Model Evaluation Benchmark Based on Real Human Difficulty Signals

A new benchmark built from ten years of Korean College Scholastic Ability Test (KCSAT) math questions, introducing the DRG metric to reveal differences in model difficulty alignment, and discovering the double-edged effect of test-time scaling

数学推理基准测试韩国高考难度对齐测试时缩放DRG指标人机对齐
Published 2026-06-09 12:25Recent activity 2026-06-10 09:21Estimated read 7 min
KCSAT-ML: A Reasoning Model Evaluation Benchmark Based on Real Human Difficulty Signals
1

Section 01

Introduction: KCSAT-ML—A New Reasoning Model Evaluation Benchmark Based on Real Human Difficulty Signals

KCSAT-ML is a reasoning model evaluation benchmark built from ten years of math questions from the Korean College Scholastic Ability Test (KCSAT). Its core advantages include introducing real human difficulty signals (official per-question error rates from hundreds of thousands of examinees' data); proposing the DRG metric to reveal alignment differences between models and human difficulty perception; and discovering key conclusions such as the double-edged effect of test-time scaling, providing a new perspective for evaluating mathematical reasoning models.

2

Section 02

Background: Core Dilemmas of Existing Mathematical Reasoning Benchmarks

Existing mathematical reasoning benchmarks generally lack per-question difficulty signals based on real human performance, relying mostly on heuristic estimates or assuming uniform question difficulty. This leads to: misleading accuracy metrics (models with the same accuracy have large differences in error types); lack of difficulty perception (inability to distinguish the error distribution of models on easy vs. hard questions for humans); and one-sided ability evaluation (ignoring human-model difficulty alignment).

3

Section 03

Methodology: Construction of KCSAT-ML Benchmark and Design of DRG Metric

KCSAT-ML Benchmark

Covers 664 math questions from the 2014-2025 KCSAT, with a core subset of 339 questions containing official per-question error rates (from millions of examinee samples in total). It covers the full spectrum of difficulty and avoids subjective bias.

DRG Metric

Difficulty-Aligned Reasoning Gain (DRG): Measures the overlap between model errors and human-difficult questions. A high DRG indicates that model errors are concentrated on human-difficult questions (aligned with human difficulty perception), while a low DRG is the opposite, revealing model differences that accuracy cannot capture.

4

Section 04

Key Findings: Three Important Patterns in Model Performance

  1. Low-cost accuracy collapses at the tail of hard questions: Under low computational budgets, model performance drops significantly on the hardest questions for humans; simply increasing scale cannot solve the problem of hard questions.
  2. Double-edged effect of test-time scaling: Token usage increases linearly with human error rates, but accuracy gains are non-monotonic; within the same model family, anti-scaling (increased computation leads to decreased performance) occurs on hard questions, while overthinking occurs on easy questions.
  3. DRG reveals hidden differences: Models with similar accuracy have vastly different DRG values; some models struggle with hard questions like humans, while others fail on easy questions (contrary to human performance).
5

Section 05

Technical Implementation: OCR Processing and Support for Visual Language Model Evaluation

  • OCR Processing: Converts math questions into text format, allowing pure-text LLMs to participate in visual mathematical reasoning evaluation.
  • VLM Evaluation: Natively supports visual language models, directly processing questions containing charts and geometric figures, expanding the benchmark's scope of application.
6

Section 06

Research Implications: Recommendations for AI Reasoning Development

  1. Diversification of evaluation metrics: Need to introduce difficulty alignment metrics based on human cognition, focusing on the distribution of error patterns rather than just the number of errors.
  2. Optimization of test-time scaling: Dynamically adjust computational budgets to avoid overthinking on easy questions and find effective reasoning paths for hard questions.
  3. New dimension of human-model alignment: Emphasize difficulty perception alignment; an ideal model should make errors on a difficulty distribution similar to humans.
  4. Open-source contribution: Open-source code and dataset tools to promote community research and model optimization.
7

Section 07

Conclusion: Value of KCSAT-ML and Future Directions

KCSAT-ML fills the gap in existing benchmarks through real human difficulty signals and the DRG metric; its findings are of great value for understanding the real capabilities of models and optimizing reasoning strategies. As reasoning models are increasingly applied in education, scientific research, and other fields, optimizing difficulty perception capabilities will become a key research direction.