Zing Forum

Reading

In-depth Exploration of How RLVR Training Reshapes the Internal Representations of Large Language Models

By comparing base models, SFT models, and RLVR models using mechanistic interpretability techniques, this study reveals the internal mechanism by which reinforcement learning from verifiable rewards optimizes the reasoning capabilities of models.

RLVR强化学习机械可解释性大语言模型推理能力TransformerDeepSeek数学推理
Published 2026-05-02 18:44Recent activity 2026-05-02 18:49Estimated read 6 min
In-depth Exploration of How RLVR Training Reshapes the Internal Representations of Large Language Models
1

Section 01

[Main Post/Introduction] In-depth Exploration of How RLVR Training Reshapes the Internal Representations of Large Language Models

This article focuses on the impact of Reinforcement Learning from Verifiable Rewards (RLVR) training on the internal representations of Large Language Models (LLMs). By comparing base models, SFT models, and RLVR models, it verifies the controversy between the "Routing Hypothesis" (only guiding knowledge retrieval) and the "Representation Learning Hypothesis" (creating new reasoning features). Using mechanistic interpretability techniques to analyze internal changes in the Transformer architecture, this study aims to reveal the internal mechanism by which RLVR optimizes reasoning capabilities and provide a theoretical basis for efficient training strategies.

2

Section 02

Research Background and Core Issues

Current controversies around RLVR focus on two hypotheses: 1. Routing Hypothesis: RLVR only adjusts attention circuits to guide existing knowledge retrieval without creating new MLP features; 2. Representation Learning Hypothesis: RLVR solidifies entirely new logical circuits and changes latent layer encoding. The study needs to delve into changes in the Transformer's residual stream, with significance in understanding the RLVR mechanism and guiding future training strategies.

3

Section 03

Three-Stage Comparative Experiment Design

The experiment constructs three versions on the same base model:

  • Base stage: Pre-trained model without specific training, serving as the initial knowledge benchmark;
  • Supervised Fine-Tuning (SFT) stage: Using the NuminaMath-CoT dataset to learn mathematical reasoning patterns and formats, with high token accuracy indicating successful imitation;
  • RLVR stage: Based on the SFT model, trained using heterogeneous mathematical datasets such as GSM8K, optimizing problem-solving trajectories via verifiable correctness rewards (not simple imitation).
4

Section 04

Reward Function and Configuration for RLVR Training

RLVR training uses the GRPOTrainer from the Hugging Face TRL library, combined with DeepSpeed distributed optimization and vLLM acceleration. Core hyperparameters: learning rate 2e-6, maximum generation length 2000. The reward function is R = R_accuracy + 0.01 × R_format: +1.0 for correct answers, +1.0 for format compliance (format has low weight, correctness is prioritized).

5

Section 05

Mechanistic Interpretability Analysis Methods

Multiple techniques are used for analysis:

  1. Component-level representation comparison: Extract hidden states, separate attention/MLP outputs, and measure similarity using Centered Kernel Alignment;
  2. Linear probing and causal intervention: Train classifiers to predict intermediate steps, use Logit Lens to map the vocabulary, and verify the role of key layers via activation patching;
  3. Weight distance and spectral analysis: Calculate the L2 norm of weight differences, and use SVD to analyze whether weight updates are low-rank (Routing Hypothesis) or concentrated in MLP (Representation Learning Hypothesis).
6

Section 06

Research Significance and Future Outlook

The research has significant impact: If the Routing Hypothesis holds, pre-stored capabilities can be activated lightweight; if the Representation Learning Hypothesis holds, continuous investment in RLVR infrastructure is needed. Regardless of the result, it provides insights into LLM learning mechanisms, promotes the development of efficient training strategies and interpretable AI, and is worthy of attention from researchers and engineers.