Section 01
The Secret in Preference Pairs: The Quality Code of DPO/KTO Training Data (Introduction)
This study focuses on the quality characteristics of training data for preference optimization (e.g., DPO/KTO), reveals the key impact of generator-level differences (the capability gap between models generating responses) and sample-level differences (the quality gap within a single pair of responses) on model reasoning performance, and proposes a two-pronged strategy of "maximizing generator-level differences + filtering data with high sample-level differences" to provide guidance for building high-quality preference datasets.