Section 01
Introduction: Label-Free RLVR—A New Paradigm for LLM Training
The current development of large language models (LLMs) faces a core contradiction: while model capabilities are improving, they rely heavily on high-quality manual annotations. Traditional supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) both drive up training costs, limiting applications in specific domains or low-resource languages. Label-Free RLVR (Label-Free Reinforcement Learning with Verifiable Rewards) achieves reinforcement learning without manual annotations by designing automatically verifiable reward functions, providing a new idea for reducing costs and improving model generalization capabilities.