# KCSAT-ML: A Reasoning Model Evaluation Benchmark Based on Real Human Difficulty Signals

> A new benchmark built from ten years of Korean College Scholastic Ability Test (KCSAT) math questions, introducing the DRG metric to reveal differences in model difficulty alignment, and discovering the double-edged effect of test-time scaling

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-09T04:25:44.000Z
- 最近活动: 2026-06-10T01:21:48.081Z
- 热度: 128.1
- 关键词: 数学推理, 基准测试, 韩国高考, 难度对齐, 测试时缩放, DRG指标, 人机对齐
- 页面链接: https://www.zingnex.cn/en/forum/thread/kcsat-ml
- Canonical: https://www.zingnex.cn/forum/thread/kcsat-ml
- Markdown 来源: floors_fallback

---

## Introduction: KCSAT-ML—A New Reasoning Model Evaluation Benchmark Based on Real Human Difficulty Signals

KCSAT-ML is a reasoning model evaluation benchmark built from ten years of math questions from the Korean College Scholastic Ability Test (KCSAT). Its core advantages include introducing real human difficulty signals (official per-question error rates from hundreds of thousands of examinees' data); proposing the DRG metric to reveal alignment differences between models and human difficulty perception; and discovering key conclusions such as the double-edged effect of test-time scaling, providing a new perspective for evaluating mathematical reasoning models.

## Background: Core Dilemmas of Existing Mathematical Reasoning Benchmarks

Existing mathematical reasoning benchmarks generally lack per-question difficulty signals based on real human performance, relying mostly on heuristic estimates or assuming uniform question difficulty. This leads to: misleading accuracy metrics (models with the same accuracy have large differences in error types); lack of difficulty perception (inability to distinguish the error distribution of models on easy vs. hard questions for humans); and one-sided ability evaluation (ignoring human-model difficulty alignment).

## Methodology: Construction of KCSAT-ML Benchmark and Design of DRG Metric

### KCSAT-ML Benchmark
Covers 664 math questions from the 2014-2025 KCSAT, with a core subset of 339 questions containing official per-question error rates (from millions of examinee samples in total). It covers the full spectrum of difficulty and avoids subjective bias.
### DRG Metric
Difficulty-Aligned Reasoning Gain (DRG): Measures the overlap between model errors and human-difficult questions. A high DRG indicates that model errors are concentrated on human-difficult questions (aligned with human difficulty perception), while a low DRG is the opposite, revealing model differences that accuracy cannot capture.

## Key Findings: Three Important Patterns in Model Performance

1. **Low-cost accuracy collapses at the tail of hard questions**: Under low computational budgets, model performance drops significantly on the hardest questions for humans; simply increasing scale cannot solve the problem of hard questions.
2. **Double-edged effect of test-time scaling**: Token usage increases linearly with human error rates, but accuracy gains are non-monotonic; within the same model family, anti-scaling (increased computation leads to decreased performance) occurs on hard questions, while overthinking occurs on easy questions.
3. **DRG reveals hidden differences**: Models with similar accuracy have vastly different DRG values; some models struggle with hard questions like humans, while others fail on easy questions (contrary to human performance).

## Technical Implementation: OCR Processing and Support for Visual Language Model Evaluation

- **OCR Processing**: Converts math questions into text format, allowing pure-text LLMs to participate in visual mathematical reasoning evaluation.
- **VLM Evaluation**: Natively supports visual language models, directly processing questions containing charts and geometric figures, expanding the benchmark's scope of application.

## Research Implications: Recommendations for AI Reasoning Development

1. **Diversification of evaluation metrics**: Need to introduce difficulty alignment metrics based on human cognition, focusing on the distribution of error patterns rather than just the number of errors.
2. **Optimization of test-time scaling**: Dynamically adjust computational budgets to avoid overthinking on easy questions and find effective reasoning paths for hard questions.
3. **New dimension of human-model alignment**: Emphasize difficulty perception alignment; an ideal model should make errors on a difficulty distribution similar to humans.
4. **Open-source contribution**: Open-source code and dataset tools to promote community research and model optimization.

## Conclusion: Value of KCSAT-ML and Future Directions

KCSAT-ML fills the gap in existing benchmarks through real human difficulty signals and the DRG metric; its findings are of great value for understanding the real capabilities of models and optimizing reasoning strategies. As reasoning models are increasingly applied in education, scientific research, and other fields, optimizing difficulty perception capabilities will become a key research direction.