The core innovation of DynaMO-RL lies in its dynamic rollout allocation mechanism. Traditional PPO training usually generates a fixed number of rollouts for each prompt (e.g., 8 responses per prompt), while DynaMO-RL dynamically adjusts based on the following factors:
Key Parameters:
rollout_n_min: Minimum number of rollouts generated per prompt (default: 4)
rollout_n_max: Maximum number of rollouts generated per prompt (default:24)
initial_budget: Initial exploration budget (default:8)
total_rollout_n: Total rollout budget
Allocation Strategy Logic:
The system first identifies prompts that need additional exploration (i.e., prompts with sampling times less than initial_budget) and prioritizes allocating rollout resources to these prompts. This design is based on an intuitive insight: in the early training stage, some prompts may not have been fully explored, and increasing their sampling diversity helps the policy converge quickly.
In code implementation, the get_rollout_n_per_prompt function implements a refined budget allocation algorithm:
- Exploration Phase: Allocate additional rollouts to prompts with insufficient sampling times
- Waterfall Filling: Allocate remaining budget to prompts with priority weighting
- Boundary Constraints: Ensure the number of rollouts per prompt is within the min/max range
- Fault Tolerance Handling: When the budget is insufficient, scale proportionally or distribute evenly
This dynamic allocation strategy allows computational resources to be concentrated on "more valuable" training samples, avoiding waste of computing power on samples with low information gain.