Zing Forum

Reading

The Secret in Preference Pairs: The Quality Code of DPO/KTO Training Data

The study reveals the impact of two types of quality differences (generator-level difference and sample-level difference) in preference optimization on reasoning performance, and proposes a two-pronged strategy of maximizing generator-level differences and filtering data with high sample-level differences.

偏好优化DPOKTO数据质量推理能力大模型对齐LLM训练
Published 2026-04-10 03:28Recent activity 2026-04-13 10:22Estimated read 6 min
The Secret in Preference Pairs: The Quality Code of DPO/KTO Training Data
1

Section 01

The Secret in Preference Pairs: The Quality Code of DPO/KTO Training Data (Introduction)

This study focuses on the quality characteristics of training data for preference optimization (e.g., DPO/KTO), reveals the key impact of generator-level differences (the capability gap between models generating responses) and sample-level differences (the quality gap within a single pair of responses) on model reasoning performance, and proposes a two-pronged strategy of "maximizing generator-level differences + filtering data with high sample-level differences" to provide guidance for building high-quality preference datasets.

2

Section 02

Preference Optimization: Core Technology and Unsolved Problems for Large Model Alignment

Preference optimization is the mainstream approach for large model alignment. Compared to RLHF, DPO/KTO has advantages such as simple implementation, stable training, and high efficiency, training models through preference pairs (chosen high-quality responses + rejected low-quality responses). However, a long-neglected question is: which characteristics in preference data drive the improvement of reasoning task performance? That is, what kind of preference pairs can enable models to learn better reasoning abilities?

3

Section 03

Dual Perspectives on Quality Differences and Experimental Design

The study proposes a dual-dimensional analysis framework: 1. Generator-level difference: focuses on the capability gap between models generating the two responses (e.g., scale, family differences); 2. Sample-level difference: focuses on the quality gap within a single pair of responses (regardless of the generating models). Experimental design: manipulate generator-level differences (change model scale/family); use LLM-as-a-Judge to evaluate sample-level differences from multiple dimensions including reasoning quality, expression quality, and factual accuracy.

4

Section 04

Core Findings of the Two-Pronged Strategy

  1. Generator-level differences improve generalization: preference pairs generated by models with large capability gaps enable models to learn more general and robust reasoning patterns; 2. Sample-level differences improve efficiency: filtering preference pairs with high differences can achieve the same performance with less data; 3. Synergistic effect: maximizing generator-level differences in the generation phase and retaining high sample-level differences in the filtering phase yields better results.
5

Section 05

Recommendations for Building High-Quality Preference Data

Generator selection: combination of strong and weak models (the strongest model generates chosen responses, weak models generate rejected ones), diverse combinations; Data filtering: automatic evaluation via LLM-as-a-Judge, threshold-based filtering of high-difference samples, multi-dimensional comprehensive assessment; Iterative optimization: monitor training performance, dynamically adjust strategies.

6

Section 06

New Insights into Preference Optimization Theory

Signal strength affects learning: larger generator-level differences mean stronger signals, similar to high-quality labels in supervised learning; Preference optimization is essentially contrastive learning: larger sample-level differences mean stronger contrast signals; Core position of data engineering: well-designed data has a more significant impact than model/algorithm improvements, echoing the data-centric AI concept.

7

Section 07

Research Limitations and Future Exploration Directions

Limitations: LLM judges may have biases, only verified on reasoning tasks, limited model scale; Future directions: adaptive data generation, multi-modal expansion, theoretical mechanism analysis.

8

Section 08

Research Summary and Significance

This study analyzes the characteristics of preference optimization data, reveals the key impact of the two types of differences on reasoning performance, and proposes a two-pronged strategy. It provides clear guidance for practitioners, offers a new perspective for theoretical research, and lays a foundation for quality control of large model training data.