# Tsallis Loss Continuum: A New Training Paradigm to Solve the Cold Start Dilemma of Reasoning Models

> This paper proposes a family of loss functions defined using Tsallis q-logarithm, which interpolates between RLVR (Reinforcement Learning from Verifiable Rewards) and density estimation. It solves the training stagnation problem of reasoning models when initial success rates are low through a gradient amplification mechanism.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-28T17:52:38.000Z
- 最近活动: 2026-04-29T04:30:17.709Z
- 热度: 147.4
- 关键词: 强化学习, 推理模型, 冷启动, Tsallis熵, 后训练, 大语言模型, 梯度优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/tsallis
- Canonical: https://www.zingnex.cn/forum/thread/tsallis
- Markdown 来源: floors_fallback

---

## Introduction: Tsallis Loss Continuum—A New Paradigm to Solve the Cold Start of Reasoning Models

This paper proposes a family of loss functions defined based on Tsallis q-logarithm, which interpolates between RLVR (Reinforcement Learning from Verifiable Rewards) and density estimation. It solves the training stagnation problem of reasoning models when initial success rates are low through a gradient amplification mechanism. The study introduces two algorithms: GARL (Gradient Amplified RL) and PAFT (Posterior Attenuated Fine-Tuning), and verifies their effectiveness on reasoning benchmarks such as FinQA and HotPotQA, providing a new paradigm for the post-training of reasoning models.

## Research Background: Cold Start Dilemma in Post-Training of Reasoning Models

Post-training of modern large language models needs to adapt to specific reasoning tasks (e.g., mathematical problem-solving, multi-hop QA), but often only output-level supervision signals are available. RLVR is a mainstream method, but it falls into cold start when initial success rates are low—signals are sparse, making it difficult for the model to get positive feedback. Traditional solutions (SFT warm-up, reward shaping, curriculum learning) have problems such as requiring additional annotations or high complexity.

## Theoretical Framework: Core Mechanism of Tsallis Loss Continuum

Inspired by Tsallis entropy, the study defines a family of loss functions J_Q that interpolates between q ∈ [0,1]: q=0 corresponds to RLVR (exploitation extreme), q=1 corresponds to density estimation (exploration extreme). All members share the same gradient direction, differing only in the scalar amplification factor P_θ^(-q). This mechanism accelerates cold start escape: the escape time is Ω(1/p0) when q=0, and shortens to Θ(log(1/p0)) when q=1. Intermediate q values balance escape speed and noise memory.

## Training Algorithms: Implementation and Analysis of GARL and PAFT

Since P_θ is difficult to compute, two Monte Carlo estimators are introduced:
1. GARL: Samples trajectories from the policy, amplifies gradients of successful trajectories, and has low variance;
2. PAFT: Performs SFT after importance resampling of successful trajectories, resulting in semantically coherent gradients.
The bias of both estimators is O(q/(M·P_θ^(q+1))), so more samples are needed to control bias as q increases.

## Experimental Verification: Results in Cold Start and Warm Start Scenarios

Experiments on FinQA, HotPotQA, and MuSiQue benchmarks:
- Cold start scenario: GARL (q=0.75) successfully escapes cold start, while GRPO (a variant of RLVR) fails;
- Warm start scenario: GARL with low q values is optimal for FinQA; PAFT (q=0.75) is stable for HotPotQA/MuSiQue, with a 14.4 percentage point improvement on HotPotQA;
- Stability: GARL is stable for structured tasks (FinQA), while PAFT is better for open reasoning tasks (HotPotQA).

## Practical Guidance: Trade-offs Between q Value and Algorithm Selection

**q Value Selection**:
| q Value | Cold Start Escape | Training Stability | Applicable Scenarios |
|---------|-------------------|--------------------|----------------------|
| Close to 0 | Slow | High | Warm start, stable tasks |
| 0.5-0.75 | Medium | Medium | General choice |
| Close to 1 | Fast | Possibly low | Difficult cold start |
**Algorithm Selection**: GARL is suitable for structured tasks and low variance requirements; PAFT is suitable for open reasoning and stability-priority scenarios. The framework unifies existing methods such as RLVR and SFT warm-up.

## Research Significance and Future Directions

**Theoretical Contributions**: Unify the dynamic framework for reasoning model training and systematically explore the exploitation-exploration trade-off;
**Practical Value**: Provide algorithms and hyperparameter guidance to solve cold start;
**Future Directions**: Adaptive q value adjustment, integration with PPO/DPO and other technologies, deepening theoretical analysis, and extension to more tasks (code generation, theorem proving).

## Summary: Value and Application Prospects of Tsallis Loss Continuum

The Tsallis Loss Continuum framework provides a powerful tool for reasoning model training, solving the cold start problem and unifying different training strategies. The GARL and PAFT algorithms perform excellently in experiments, especially in cold start scenarios, outperforming traditional RLVR. This work provides new ideas for researchers and engineers and is expected to become an important part of reasoning model training.