# HIVE: Dynamically Selecting High-Value Prompts at the 'Learning Edge' to Improve RL Training Efficiency

> The HIVE framework precisely locates the 'medium difficulty + high uncertainty' learning edge region through historical reward trajectories and real-time prompt entropy filtering, enabling efficient reinforcement learning training on mathematical reasoning tasks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-26T08:52:35.000Z
- 最近活动: 2026-03-27T05:23:25.169Z
- 热度: 115.5
- 关键词: 强化学习, 大语言模型, 提示词选择, GRPO, 数据效率
- 页面链接: https://www.zingnex.cn/en/forum/thread/hive-rl
- Canonical: https://www.zingnex.cn/forum/thread/hive-rl
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: HIVE: Dynamically Selecting High-Value Prompts at the 'Learning Edge' to Improve RL Training Efficiency

The HIVE framework precisely locates the 'medium difficulty + high uncertainty' learning edge region through historical reward trajectories and real-time prompt entropy filtering, enabling efficient reinforcement learning training on mathematical reasoning tasks.

## Problem Background

Reinforcement Learning (RL) has become a key technology for post-training large language models, but computational cost is a bottleneck. In algorithms like GRPO, each prompt requires multiple rollouts, yet the gradient provided by many prompts is negligible.

## Key Finding: Learning Edge

Experimental analysis reveals two key properties of sample utility:
- **Non-uniform distribution**: The strongest learning signals are concentrated in specific regions
- **Dynamic evolution**: This region shifts as training progresses

**Learning Edge** = Intersection of medium difficulty × high uncertainty

## HIVE Framework

Two-stage data-efficient RL framework:

1. **Coarse screening with historical information**: Preliminary filtering using historical reward trajectories
2. **Fine pruning via online validation**: Using prompt entropy as a real-time proxy to prune instances with outdated utility

## Experimental Results

Evaluations on multiple mathematical reasoning benchmarks and models show:
- Significant improvement in rollout efficiency
- No sacrifice in model performance
- Dynamic adaptation to training progress

## Technical Value

This work provides a smarter data selection strategy for RL training, allowing computational resources to focus on samples with the highest learning value.