Zing Forum

Reading

DynaMO-RL: An Efficient Reinforcement Learning Training Framework for Large Language Models

DynaMO-RL is an open-source reinforcement learning training framework that focuses on optimizing the policy learning process of large language models by dynamically allocating rollout resources and adjusting the advantage function. Extending the VERL framework, this project implements the DAPO (Dynamic Allocation Policy Optimization) algorithm, supports FSDP and Megatron distributed training strategies, and can significantly reduce training costs while improving convergence efficiency.

reinforcement learninglarge language modelsPPORLHFdistributed trainingRayFSDPpolicy optimizationDAPOByteDance
Published 2026-06-05 10:45Recent activity 2026-06-05 10:50Estimated read 10 min
DynaMO-RL: An Efficient Reinforcement Learning Training Framework for Large Language Models
1

Section 01

Introduction / Main Post: DynaMO-RL: An Efficient Reinforcement Learning Training Framework for Large Language Models

DynaMO-RL is an open-source reinforcement learning training framework that focuses on optimizing the policy learning process of large language models by dynamically allocating rollout resources and adjusting the advantage function. Extending the VERL framework, this project implements the DAPO (Dynamic Allocation Policy Optimization) algorithm, supports FSDP and Megatron distributed training strategies, and can significantly reduce training costs while improving convergence efficiency.

2

Section 02

Original Author and Source

  • Original Author/Maintainer: Rudge0 (GitHub)
  • Source Platform: GitHub
  • Original Title: DynaMO-RL
  • Original Link: https://github.com/Rudge0/DynaMO-RL
  • Open Source License: Apache License 2.0
  • Source Code Update Time: 2026-06-05

3

Section 03

Background and Motivation

As large language models (LLMs) demonstrate strong capabilities across various tasks, how to further align model behavior with human preferences via reinforcement learning (RL) has become a research hotspot. However, the traditional RLHF (Reinforcement Learning from Human Feedback) training process often faces two core challenges:

  1. High computational resource consumption: Generating rollouts (model sampled outputs) requires a large amount of forward inference computation. Especially when the policy quality is low in the early training stage, many rollouts contribute little to policy improvement.

  2. Unstable advantage estimation: The advantage function calculation in the standard PPO (Proximal Policy Optimization) algorithm treats all samples uniformly, failing to handle them differently based on their potential learning value.

To address these issues, DynaMO-RL proposes a set of dynamic resource allocation and advantage adjustment mechanisms, which significantly reduce computational overhead while ensuring training effectiveness.


4

Section 04

Project Overview

DynaMO-RL is a reinforcement learning training framework specifically designed for large-scale language models. The name DynaMO stands for "Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization". Extending ByteDance's open-source VERL (Versatile Efficient Reinforcement Learning) framework, this project core implements the DAPO (Dynamic Allocation Policy Optimization) training algorithm. The project is open-sourced under the Apache 2.0 license, with a clear code structure and support for multiple distributed training backends.

5

Section 05

Core Architecture Components

From the code structure perspective, DynaMO-RL includes the following key modules:

  • verl/: Core training framework, including FSDP/Megatron workers, PPO trainers, resource pool managers, etc.
  • rllm/: RL-related data processing and reward calculation modules
  • recipe/: Predefined training configurations and scripts, including implementations of various algorithms such as dynamo, dapo, r1, prime, etc.
  • recipe/dynamo/: Core implementation of the DAPO algorithm, including Ray distributed trainers

6

Section 06

1. Dynamic Rollout Allocation Strategy

The core innovation of DynaMO-RL lies in its dynamic rollout allocation mechanism. Traditional PPO training usually generates a fixed number of rollouts for each prompt (e.g., 8 responses per prompt), while DynaMO-RL dynamically adjusts based on the following factors:

Key Parameters:

  • rollout_n_min: Minimum number of rollouts generated per prompt (default: 4)
  • rollout_n_max: Maximum number of rollouts generated per prompt (default:24)
  • initial_budget: Initial exploration budget (default:8)
  • total_rollout_n: Total rollout budget

Allocation Strategy Logic:

The system first identifies prompts that need additional exploration (i.e., prompts with sampling times less than initial_budget) and prioritizes allocating rollout resources to these prompts. This design is based on an intuitive insight: in the early training stage, some prompts may not have been fully explored, and increasing their sampling diversity helps the policy converge quickly.

In code implementation, the get_rollout_n_per_prompt function implements a refined budget allocation algorithm:

  1. Exploration Phase: Allocate additional rollouts to prompts with insufficient sampling times
  2. Waterfall Filling: Allocate remaining budget to prompts with priority weighting
  3. Boundary Constraints: Ensure the number of rollouts per prompt is within the min/max range
  4. Fault Tolerance Handling: When the budget is insufficient, scale proportionally or distribute evenly

This dynamic allocation strategy allows computational resources to be concentrated on "more valuable" training samples, avoiding waste of computing power on samples with low information gain.

7

Section 07

2. Advantage Function Modulation Mechanism

In addition to rollout allocation, DynaMO-RL also performs fine modulation on the Advantage Function. The advantage function in reinforcement learning is used to evaluate the quality of an action relative to the average level, which is a key signal for policy updates.

DynaMO-RL's advantage modulation is reflected in:

  • Differentiated Handling: Dynamically adjust advantage weights based on the quality and diversity of rollouts
  • Stability Optimization: Prevent overly aggressive policy updates through KL divergence constraints
  • Multi-source Reward Fusion: Support combination of rule-based rewards, model rewards, and code execution feedback
8

Section 08

3. Distributed Training Support

DynaMO-RL implements efficient distributed training based on the Ray framework:

  • FSDP/FSDP2 Backend: Suitable for consumer GPUs and small-scale clusters
  • Megatron Backend: Supports distributed training of large-scale models (10 billion to 100 billion parameters)
  • Ray Actor Architecture: Components such as Actor, Critic, and Reward Model run as independent Ray Actors, supporting flexible resource scheduling