Zing Forum

Reading

The "Step Confusion" Trap in Large Model Reasoning Data Selection: How to Identify and Correct Systematic Biases in Data Quality Evaluation

Recent research has found that naturalness-based data selection methods exhibit systematic biases when evaluating large model reasoning data—they tend to select samples with longer reasoning steps rather than higher-quality ones. The researchers proposed two correction methods, ASLEC-DROP and ASLEC-CASL, which significantly improve the accuracy of reasoning data screening by eliminating the interference of initial word probabilities.

大语言模型推理数据数据选择步长混淆监督微调自然度评估因果推断ASLEC
Published 2026-04-08 16:51Recent activity 2026-04-09 09:18Estimated read 5 min
The "Step Confusion" Trap in Large Model Reasoning Data Selection: How to Identify and Correct Systematic Biases in Data Quality Evaluation
1

Section 01

[Introduction] The Step Confusion Trap in Large Model Reasoning Data Selection and Its Correction Methods

Recent research has found that naturalness-based data selection methods have systematic biases when evaluating large model reasoning data—they tend to select samples with longer reasoning steps rather than higher-quality ones. The researchers proposed two correction methods, ASLEC-DROP and ASLEC-CASL, which significantly improve the accuracy of reasoning data screening by eliminating the interference of initial word probabilities. This article will analyze this problem and its solutions in separate floors.

2

Section 02

Background: Hidden Concerns in Reasoning Data Screening and Problems with Naturalness Methods

In recent years, Long-Chain Reasoning Models (LRMs) rely on high-quality reasoning datasets for Supervised Fine-Tuning (SFT). When building datasets, naturalness-based automatic screening methods (sorted by model average log probability) are commonly used, but this method has hidden biases in reasoning data scenarios.

3

Section 03

The Step Confusion Trap: Causes and Mathematical Explanation

Step Confusion Phenomenon: Naturalness evaluation prefers samples with longer reasoning steps rather than high-quality ones. The cause is that the initial words of reasoning steps (e.g., "firstly") have low probabilities, and subsequent high-probability tokens in longer steps dilute the impact of initial words. Mathematically, in the average log probability formula, the larger the number of tokens N, the smaller the impact of the initial word's low probability, leading to higher scores for verbose samples.

4

Section 04

Solutions: Two De-Biasing Strategies

  1. ASLEC-DROP: Exclude the initial word when calculating the average log probability, directly eliminating the interference of the initial word—simple and efficient to implement.
  2. ASLEC-CASL: Remove the confounding effect of initial word probability through a causal regression model, retain useful information from the initial word, and achieve refined de-biasing.
5

Section 05

Experimental Validation: Robust Improvements Across Models and Benchmarks

Experiments were conducted on 4 models and 5 reasoning benchmarks, and the results show:

  • Significant improvement in correlation with manual quality evaluation;
  • Improved performance in downstream tasks (mathematical reasoning, code generation, etc.);
  • Stable cross-model consistency;
  • Controllable computational overhead.
6

Section 06

Practical Insights: New Perspectives on Reasoning Data Engineering

  • Be alert to the hidden biases of evaluation metrics in specific scenarios;
  • Reasoning data has a step structure, so evaluation needs to consider its particularity;
  • Open-source tools can be directly applied to improve data quality without additional costs.
7

Section 07

Conclusion: Significance of Bias Correction and Future Outlook

The solution to the step confusion problem reflects an in-depth understanding of model biases. The ASLEC methods will become key tools for building high-quality reasoning data, facilitating the application of large models in complex tasks.