Section 01
EGSPO-SA: A New Paradigm for Infusing Reinforcement Learning into Diffusion Language Models (Introduction)
The Texas A&M University team proposes the EGSPO-SA (Entropy-Guided Stepwise Policy Optimization with Stepwise Advantages) framework, which addresses core challenges in RL fine-tuning of diffusion language models through entropy-guided step selection and lightweight advantage estimators. This framework has achieved significant breakthroughs in core benchmark tests such as code generation, logical reasoning, and mathematical reasoning, and has open-sourced the implementation code and model checkpoints.