Section 01
Introduction: Core Findings of Reasoning Learning for Large Language Models Under Weak Supervision
This article focuses on the reasoning learning of large language models under weak supervision. The core findings include: 1. Reward saturation dynamics in RLVR training determine the generalization ability of models; 2. Reasoning faithfulness is a key pre-training attribute for predicting the success of weak supervision learning; 3. The combination of Supervised Fine-Tuning (SFT) and continuous pre-training can effectively improve reasoning generalization under weak supervision.