Zing Forum

Reading

HIVE: Dynamically Selecting High-Value Prompts at the 'Learning Edge' to Improve RL Training Efficiency

The HIVE framework precisely locates the 'medium difficulty + high uncertainty' learning edge region through historical reward trajectories and real-time prompt entropy filtering, enabling efficient reinforcement learning training on mathematical reasoning tasks.

强化学习大语言模型提示词选择GRPO数据效率
Published 2026-03-26 16:52Recent activity 2026-03-27 13:23Estimated read 3 min
HIVE: Dynamically Selecting High-Value Prompts at the 'Learning Edge' to Improve RL Training Efficiency
1

Section 01

Introduction / Main Floor: HIVE: Dynamically Selecting High-Value Prompts at the 'Learning Edge' to Improve RL Training Efficiency

The HIVE framework precisely locates the 'medium difficulty + high uncertainty' learning edge region through historical reward trajectories and real-time prompt entropy filtering, enabling efficient reinforcement learning training on mathematical reasoning tasks.

2

Section 02

Problem Background

Reinforcement Learning (RL) has become a key technology for post-training large language models, but computational cost is a bottleneck. In algorithms like GRPO, each prompt requires multiple rollouts, yet the gradient provided by many prompts is negligible.

3

Section 03

Key Finding: Learning Edge

Experimental analysis reveals two key properties of sample utility:

  • Non-uniform distribution: The strongest learning signals are concentrated in specific regions
  • Dynamic evolution: This region shifts as training progresses

Learning Edge = Intersection of medium difficulty × high uncertainty

4

Section 04

HIVE Framework

Two-stage data-efficient RL framework:

  1. Coarse screening with historical information: Preliminary filtering using historical reward trajectories
  2. Fine pruning via online validation: Using prompt entropy as a real-time proxy to prune instances with outdated utility
5

Section 05

Experimental Results

Evaluations on multiple mathematical reasoning benchmarks and models show:

  • Significant improvement in rollout efficiency
  • No sacrifice in model performance
  • Dynamic adaptation to training progress
6

Section 06

Technical Value

This work provides a smarter data selection strategy for RL training, allowing computational resources to focus on samples with the highest learning value.