# DelTA: A Discriminative Token Credit Assignment Method in Reinforcement Learning with Verifiable Rewards

> DelTA proposes a new RLVR training method. Through a discriminative token credit assignment mechanism, it amplifies the gradient direction of discriminative tokens and suppresses shared high-frequency patterns. On mathematical reasoning benchmarks, it achieves improvements of 3.26 and 2.62 percentage points compared to baselines.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-20T17:53:09.000Z
- 最近活动: 2026-05-21T03:20:19.645Z
- 热度: 136.6
- 关键词: 强化学习, RLVR, 大语言模型, 推理能力, 信用分配, Token级优化, 数学推理, GRPO, 策略梯度, 机器学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/delta-token
- Canonical: https://www.zingnex.cn/forum/thread/delta-token
- Markdown 来源: floors_fallback

---

## DelTA Method Guide: Improving Token-Level Credit Assignment Efficiency in RLVR

DelTA (Discriminative Token Credit Assignment Method) is an innovative training method for Reinforcement Learning with Verifiable Rewards (RLVR). Its core lies in amplifying the gradient direction of discriminative tokens and suppressing shared high-frequency patterns through a discriminative token credit assignment mechanism. On mathematical reasoning benchmarks, Qwen3-8B-Base achieves an average improvement of 3.26 percentage points compared to the strongest baseline of the same scale, and Qwen3-14B-Base improves by 2.62 percentage points. This effectively solves the problem in traditional RLVR where response-level reward averaging dilutes the signals of key tokens.

## The Rise and Core Challenges of RLVR

Reinforcement Learning with Verifiable Rewards (RLVR) has become a core technology for enhancing the reasoning capabilities of large language models, with significant effects on tasks such as mathematical reasoning and code generation (e.g., DeepSeek-R1, OpenAI o-series models). However, RLVR faces a fundamental problem: how to convert response-level rewards into token-level probability updates? Traditional methods average the reward of the entire response across all tokens, and this coarse-grained credit assignment may dilute the signals of truly critical decision tokens.

## Core Method Design of DelTA

DelTA re-examines the RLVR update process from the discriminator's perspective:
1. **Linear Discrimination of Token Gradient Vectors**: The policy gradient update direction is a linear discriminator in the token gradient vector space, constructed from the centroids of positive and negative samples, but it is easily dominated by shared high-frequency patterns (e.g., format tokens);
2. **Token Coefficient Estimation**: Learn to estimate coefficients for each token to amplify the gradients of discriminative tokens and suppress shared/weakly discriminative tokens;
3. **Self-Normalized RLVR Alternative Objective**: Reweight the objective function using coefficients to enhance the contrast between the centroids of positive and negative samples;
4. **Margin-Coupled GRPO**: Jointly optimize rollout-based relational reasoning and continuous boundary regression to align interpretable comparison reasons with fine-grained numerical differences.

## DelTA Experimental Results: Verification of Mathematical Reasoning and Generalization Capabilities

Evaluation results on 7 mathematical reasoning benchmarks:
- **Key Improvements**: Qwen3-8B-Base achieves an average improvement of 3.26 percentage points compared to the strongest baseline of the same scale, and Qwen3-14B-Base improves by 2.62 percentage points;
- **Generalization Capability**: It maintains performance improvements in code generation tasks, different backbone networks, and out-of-domain tasks, proving the effectiveness of its general RLVR improvement strategy.

## Technical Significance and Application Value of DelTA

**Technical Significance**:
- Importance of Fine-Grained Credit Assignment: Identifying the relative importance of tokens within responses improves learning efficiency, similar to how humans focus on key reasoning steps;
- Automatic Discriminative Feature Discovery: The coefficient learning mechanism automatically selects tokens that distinguish good and bad responses, reducing reliance on manual reward shaping;
- Compatibility: It can be seamlessly integrated with existing RLVR frameworks such as PPO and GRPO, enabling plug-and-play use.

**Application Value**:
- More Efficient Training: Precise credit assignment reduces training steps;
- Better Interpretability: Token coefficients reveal the decision points that the model focuses on;
- Reduced Hyperparameter Cost: It reduces sensitivity to hyperparameters such as reward scaling.

## Limitations of DelTA and Future Exploration Directions

Despite the significant progress made by DelTA, further exploration is needed:
- **Long Sequence Optimization**: Optimization of computational costs for token-level credit assignment in extremely long responses;
- **Multi-Turn Dialogue**: Expansion to multi-turn interaction scenarios;
- **Technical Synergy**: Effects of combining with methods such as process supervision and Monte Carlo Tree Search.
