Reading

DelTA: A Discriminative Token Credit Assignment Method in Reinforcement Learning with Verifiable Rewards

DelTA proposes a new RLVR training method. Through a discriminative token credit assignment mechanism, it amplifies the gradient direction of discriminative tokens and suppresses shared high-frequency patterns. On mathematical reasoning benchmarks, it achieves improvements of 3.26 and 2.62 percentage points compared to baselines.

强化学习RLVR大语言模型推理能力信用分配Token级优化数学推理GRPO策略梯度机器学习

Published 2026-05-21 01:53Recent activity 2026-05-21 11:20Estimated read 7 min

DelTA: A Discriminative Token Credit Assignment Method in Reinforcement Learning with Verifiable Rewards

Section 01

DelTA Method Guide: Improving Token-Level Credit Assignment Efficiency in RLVR

DelTA (Discriminative Token Credit Assignment Method) is an innovative training method for Reinforcement Learning with Verifiable Rewards (RLVR). Its core lies in amplifying the gradient direction of discriminative tokens and suppressing shared high-frequency patterns through a discriminative token credit assignment mechanism. On mathematical reasoning benchmarks, Qwen3-8B-Base achieves an average improvement of 3.26 percentage points compared to the strongest baseline of the same scale, and Qwen3-14B-Base improves by 2.62 percentage points. This effectively solves the problem in traditional RLVR where response-level reward averaging dilutes the signals of key tokens.

Section 02

The Rise and Core Challenges of RLVR

Reinforcement Learning with Verifiable Rewards (RLVR) has become a core technology for enhancing the reasoning capabilities of large language models, with significant effects on tasks such as mathematical reasoning and code generation (e.g., DeepSeek-R1, OpenAI o-series models). However, RLVR faces a fundamental problem: how to convert response-level rewards into token-level probability updates? Traditional methods average the reward of the entire response across all tokens, and this coarse-grained credit assignment may dilute the signals of truly critical decision tokens.

Section 03

Core Method Design of DelTA

DelTA re-examines the RLVR update process from the discriminator's perspective:

Linear Discrimination of Token Gradient Vectors: The policy gradient update direction is a linear discriminator in the token gradient vector space, constructed from the centroids of positive and negative samples, but it is easily dominated by shared high-frequency patterns (e.g., format tokens);
Token Coefficient Estimation: Learn to estimate coefficients for each token to amplify the gradients of discriminative tokens and suppress shared/weakly discriminative tokens;
Self-Normalized RLVR Alternative Objective: Reweight the objective function using coefficients to enhance the contrast between the centroids of positive and negative samples;
Margin-Coupled GRPO: Jointly optimize rollout-based relational reasoning and continuous boundary regression to align interpretable comparison reasons with fine-grained numerical differences.

Section 04

DelTA Experimental Results: Verification of Mathematical Reasoning and Generalization Capabilities

Evaluation results on 7 mathematical reasoning benchmarks:

Key Improvements: Qwen3-8B-Base achieves an average improvement of 3.26 percentage points compared to the strongest baseline of the same scale, and Qwen3-14B-Base improves by 2.62 percentage points;
Generalization Capability: It maintains performance improvements in code generation tasks, different backbone networks, and out-of-domain tasks, proving the effectiveness of its general RLVR improvement strategy.

Section 05

Technical Significance and Application Value of DelTA

Technical Significance:

Importance of Fine-Grained Credit Assignment: Identifying the relative importance of tokens within responses improves learning efficiency, similar to how humans focus on key reasoning steps;
Automatic Discriminative Feature Discovery: The coefficient learning mechanism automatically selects tokens that distinguish good and bad responses, reducing reliance on manual reward shaping;
Compatibility: It can be seamlessly integrated with existing RLVR frameworks such as PPO and GRPO, enabling plug-and-play use.

Application Value:

More Efficient Training: Precise credit assignment reduces training steps;
Better Interpretability: Token coefficients reveal the decision points that the model focuses on;
Reduced Hyperparameter Cost: It reduces sensitivity to hyperparameters such as reward scaling.

Section 06

Limitations of DelTA and Future Exploration Directions

Despite the significant progress made by DelTA, further exploration is needed:

Long Sequence Optimization: Optimization of computational costs for token-level credit assignment in extremely long responses;
Multi-Turn Dialogue: Expansion to multi-turn interaction scenarios;
Technical Synergy: Effects of combining with methods such as process supervision and Monte Carlo Tree Search.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15