Section 01
Introduction: RLVR—A New Paradigm for Large Language Model Training Without Manual Annotation
RLVR (Reinforcement Learning with Verifiable Rewards) is a new paradigm that breaks the traditional large language model training's reliance on manual annotation. It generates reward signals through automatically verifiable objective criteria, allowing the model to optimize autonomously. It addresses the limitations of Supervised Fine-Tuning (SFT) which requires large amounts of annotated data, and Reinforcement Learning from Human Feedback (RLHF) which relies on expensive human preference annotations. It shows potential in tasks like mathematical reasoning and code generation.