Zing Forum

Reading

Tsallis Loss Continuum: A New Training Paradigm to Solve the Cold Start Dilemma of Reasoning Models

This paper proposes a family of loss functions defined using Tsallis q-logarithm, which interpolates between RLVR (Reinforcement Learning from Verifiable Rewards) and density estimation. It solves the training stagnation problem of reasoning models when initial success rates are low through a gradient amplification mechanism.

强化学习推理模型冷启动Tsallis熵后训练大语言模型梯度优化
Published 2026-04-29 01:52Recent activity 2026-04-29 12:30Estimated read 7 min
Tsallis Loss Continuum: A New Training Paradigm to Solve the Cold Start Dilemma of Reasoning Models
1

Section 01

Introduction: Tsallis Loss Continuum—A New Paradigm to Solve the Cold Start of Reasoning Models

This paper proposes a family of loss functions defined based on Tsallis q-logarithm, which interpolates between RLVR (Reinforcement Learning from Verifiable Rewards) and density estimation. It solves the training stagnation problem of reasoning models when initial success rates are low through a gradient amplification mechanism. The study introduces two algorithms: GARL (Gradient Amplified RL) and PAFT (Posterior Attenuated Fine-Tuning), and verifies their effectiveness on reasoning benchmarks such as FinQA and HotPotQA, providing a new paradigm for the post-training of reasoning models.

2

Section 02

Research Background: Cold Start Dilemma in Post-Training of Reasoning Models

Post-training of modern large language models needs to adapt to specific reasoning tasks (e.g., mathematical problem-solving, multi-hop QA), but often only output-level supervision signals are available. RLVR is a mainstream method, but it falls into cold start when initial success rates are low—signals are sparse, making it difficult for the model to get positive feedback. Traditional solutions (SFT warm-up, reward shaping, curriculum learning) have problems such as requiring additional annotations or high complexity.

3

Section 03

Theoretical Framework: Core Mechanism of Tsallis Loss Continuum

Inspired by Tsallis entropy, the study defines a family of loss functions J_Q that interpolates between q ∈ [0,1]: q=0 corresponds to RLVR (exploitation extreme), q=1 corresponds to density estimation (exploration extreme). All members share the same gradient direction, differing only in the scalar amplification factor P_θ^(-q). This mechanism accelerates cold start escape: the escape time is Ω(1/p0) when q=0, and shortens to Θ(log(1/p0)) when q=1. Intermediate q values balance escape speed and noise memory.

4

Section 04

Training Algorithms: Implementation and Analysis of GARL and PAFT

Since P_θ is difficult to compute, two Monte Carlo estimators are introduced:

  1. GARL: Samples trajectories from the policy, amplifies gradients of successful trajectories, and has low variance;
  2. PAFT: Performs SFT after importance resampling of successful trajectories, resulting in semantically coherent gradients. The bias of both estimators is O(q/(M·P_θ^(q+1))), so more samples are needed to control bias as q increases.
5

Section 05

Experimental Verification: Results in Cold Start and Warm Start Scenarios

Experiments on FinQA, HotPotQA, and MuSiQue benchmarks:

  • Cold start scenario: GARL (q=0.75) successfully escapes cold start, while GRPO (a variant of RLVR) fails;
  • Warm start scenario: GARL with low q values is optimal for FinQA; PAFT (q=0.75) is stable for HotPotQA/MuSiQue, with a 14.4 percentage point improvement on HotPotQA;
  • Stability: GARL is stable for structured tasks (FinQA), while PAFT is better for open reasoning tasks (HotPotQA).
6

Section 06

Practical Guidance: Trade-offs Between q Value and Algorithm Selection

q Value Selection:

q Value Cold Start Escape Training Stability Applicable Scenarios
Close to 0 Slow High Warm start, stable tasks
0.5-0.75 Medium Medium General choice
Close to 1 Fast Possibly low Difficult cold start
Algorithm Selection: GARL is suitable for structured tasks and low variance requirements; PAFT is suitable for open reasoning and stability-priority scenarios. The framework unifies existing methods such as RLVR and SFT warm-up.
7

Section 07

Research Significance and Future Directions

Theoretical Contributions: Unify the dynamic framework for reasoning model training and systematically explore the exploitation-exploration trade-off; Practical Value: Provide algorithms and hyperparameter guidance to solve cold start; Future Directions: Adaptive q value adjustment, integration with PPO/DPO and other technologies, deepening theoretical analysis, and extension to more tasks (code generation, theorem proving).

8

Section 08

Summary: Value and Application Prospects of Tsallis Loss Continuum

The Tsallis Loss Continuum framework provides a powerful tool for reasoning model training, solving the cold start problem and unifying different training strategies. The GARL and PAFT algorithms perform excellently in experiments, especially in cold start scenarios, outperforming traditional RLVR. This work provides new ideas for researchers and engineers and is expected to become an important part of reasoning model training.