Zing Forum

Reading

RELEX: A Minimalist RLVR Training Method Based on Rank-1 Trajectory Extrapolation

The study finds that RLVR weight trajectories have extremely low rank and highly predictable characteristics. It proposes the RELEX method, which estimates the rank-1 subspace through a short observation window and linearly extrapolates future checkpoints. With only 15% of the full training steps, it can match or surpass the performance of complete RLVR, and can extrapolate to steps 10-20 times farther than the observation window.

RLVR强化学习低秩近似训练外推推理能力参数轨迹Qwen计算效率
Published 2026-05-21 01:53Recent activity 2026-05-21 10:51Estimated read 9 min
RELEX: A Minimalist RLVR Training Method Based on Rank-1 Trajectory Extrapolation
1

Section 01

Introduction: RELEX—An Efficient RLVR Training Method Based on Low-Rank Trajectory Extrapolation

The study finds that RLVR weight trajectories have extremely low rank and highly predictable characteristics. It proposes the RELEX method, which estimates the rank-1 subspace through a short observation window and linearly extrapolates future checkpoints. With only 15% of the full training steps, it can match or surpass the performance of complete RLVR, and can extrapolate to steps 10-20 times farther than the observation window, providing a new approach to address the high training cost of RLVR.

2

Section 02

Background: The High Cost Bottleneck of RLVR Training

Background: High Training Cost of RLVR

Reinforcement Learning with Verifiable Rewards (RLVR) has become a mainstream paradigm for improving the reasoning ability of Large Language Models (LLMs), achieving significant results in tasks such as mathematical reasoning and code generation. However, RLVR training has extremely high computational costs, usually requiring thousands of gradient updates and consuming a large amount of GPU resources. Traditional improvement directions (reward models, policy gradient optimization, etc.) still follow the "train until convergence" paradigm, and the core question is whether a more efficient way can be found to achieve the same performance.

3

Section 03

Core Finding: Low-Rank Characteristics of RLVR Weight Trajectories

Core Finding: Low-Rank Characteristics of RLVR Trajectories

The research team conducted a geometric analysis of the parameter change trajectory of RLVR training and found that the weight trajectory has an extremely low effective rank—most information of parameter increments can be captured by rank-1 approximation, and the magnitude of the rank-1 projection grows approximately linearly with the number of training steps. This means that training is essentially adjusting the model in a one-dimensional direction; once the dominant direction is identified, future parameter changes can be predicted without actual training.

4

Section 04

RELEX Method Design: Minimalist Extrapolation Process

RELEX Method Design

Based on the low-rank finding, the RELEX (REinforcement Learning EXtrapolation) method is proposed, whose core is to estimate the rank-1 subspace through a short training trajectory and linearly extrapolate future checkpoints.

Algorithm Flow

Step 1: Observation Window Collection: Run standard RLVR training for a short time (e.g., 50-100 steps) and collect parameter increments Δθ_t. Step 2: Rank-1 Subspace Estimation: Perform SVD on the Δθ_t matrix and extract the vector corresponding to the largest singular value to form the rank-1 subspace. Step 3: Linear Extrapolation: Fit the linear relationship between the rank-1 projection magnitude and the number of steps to predict future increments. Step 4: Checkpoint Synthesis: Accumulate the extrapolated increments to the initial parameters to generate future checkpoints.

The computational overhead of the entire process is negligible, far lower than that of RLVR training itself.

5

Section 05

Experimental Validation: Significant Improvement in Efficiency and Generalization Ability

Experimental Validation and Key Results

Validated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base models, covering tasks such as mathematical reasoning and code generation.

Training Efficiency Improvement: Only 15% of the full training steps are needed to match or surpass performance (e.g., 150 steps of observation for 1000 steps of training). Ultra-Far Extrapolation Ability: Observing 50 steps can extrapolate to 1000 steps (20x), with performance continuously improving. Cross-Domain Generalization: The generated checkpoints have generalization ability on unseen tasks comparable to fully trained models.

6

Section 06

Ablation Analysis: Sufficiency of Rank-1 and Linear Models

Ablation Analysis and Mechanism Understanding

Sufficiency of Rank-1: Increasing the subspace rank (rank2/rank5) does not improve performance, verifying that the dominant dynamics are concentrated in a one-dimensional direction. Sufficiency of Linear Model: Nonlinear models (neural networks/higher-order polynomials) do not improve performance, indicating that the projection magnitude has an approximately linear relationship with the number of steps. Explanation of Denoising Effect: RELEX filters out random optimization noise in RLVR updates, retains the signals that drive performance improvement, and avoids degradation caused by noise accumulation.

7

Section 07

Implications: Multiple Significance for RLVR Practice

Implications for RLVR Practice

  1. RLVR training may converge faster, and efficient algorithms can be designed to search directly in low-dimensional subspaces;
  2. Provides a training preview method: Predict the full training benefits through short exploratory training, which is beneficial for hyperparameter search and ablation studies;
  3. Reveals the geometric structure of RLVR training, providing a new perspective for understanding how reinforcement learning changes the reasoning behavior of LLMs.
8

Section 08

Limitations and Future Directions

Limitations and Future Directions

Limitations: Currently, it is aimed at policy gradient RLVR, and its applicability to other reinforcement learning variants needs to be verified; whether the rank-1 assumption holds in the later stages of training and how to handle multiple dominant directions in multi-task training need to be explored.

Future Directions: Develop adaptive rank adjustment methods; explore combination with model merging techniques; apply low-rank extrapolation to other training dynamics such as supervised fine-tuning and continuous learning.