Section 01
Introduction: Tsallis Loss Continuum—A New Paradigm to Solve the Cold Start of Reasoning Models
This paper proposes a family of loss functions defined based on Tsallis q-logarithm, which interpolates between RLVR (Reinforcement Learning from Verifiable Rewards) and density estimation. It solves the training stagnation problem of reasoning models when initial success rates are low through a gradient amplification mechanism. The study introduces two algorithms: GARL (Gradient Amplified RL) and PAFT (Posterior Attenuated Fine-Tuning), and verifies their effectiveness on reasoning benchmarks such as FinQA and HotPotQA, providing a new paradigm for the post-training of reasoning models.