Zing Forum

Reading

AgentFlow-Pro: A Multi-step Reasoning Agent Training Framework Based on Process-Supervised Reinforcement Learning

AgentFlow-Pro is a ground-up reimplementation of AgentFlow, introducing a learned Process Reward Model (PRM) and the DAPO algorithm. It upgrades multi-step reasoning agent training from trajectory-level feedback to step-by-step fine-grained supervision, significantly improving credit assignment efficiency.

强化学习过程奖励模型DAPOAgentFlow多步推理智能体训练Qwen3PRMLLM开源项目
Published 2026-05-30 16:55Recent activity 2026-05-30 17:22Estimated read 9 min
AgentFlow-Pro: A Multi-step Reasoning Agent Training Framework Based on Process-Supervised Reinforcement Learning
1

Section 01

AgentFlow-Pro Framework Guide: Process-Supervised Reinforcement Learning Boosts Multi-step Reasoning Efficiency

AgentFlow-Pro is a modern, ground-up implementation of the ICLR 2026 paper AgentFlow. Its core innovations include introducing a learned Process Reward Model (PRM) and the DAPO algorithm, upgrading multi-step reasoning agent training from trajectory-level feedback to step-by-step fine-grained supervision, which significantly improves credit assignment efficiency. This framework addresses the flaws of current outcome-oriented reward mechanisms in multi-step reasoning (e.g., inability to distinguish between good and bad steps, zero-gradient issues), providing a reproducible path for building reliable multi-step reasoning agents.

2

Section 02

Research Background: Limitations of Outcome-Oriented Rewards in Multi-step Reasoning

Current mainstream reinforcement learning methods for large language models (e.g., GRPO, Flow-GRPO) use outcome-oriented reward mechanisms, providing only a single feedback based on the final answer at the end of the task. This design works for single-step tasks but has serious flaws in multi-step reasoning: it cannot distinguish between the quality of decisions at each step (e.g., in a five-step reasoning process, the first step is excellent and the third is redundant, but both get the same gradient); when all trajectories are either all correct or all wrong for a certain prompt, zero gradients are generated, wasting computing resources. AgentFlow-Pro was created to solve this problem.

3

Section 03

Project Overview and Core Architecture Design

AgentFlow-Pro adopts a Planner→Executor→Verifier loop architecture, with only the Planner module being trainable. The core modules include: 1. Planner: Fine-tuned based on Qwen3-8B + LoRA, outputting JSON containing thought, action, and action_input at each step; 2. Executor: Schedules tools (e.g., Tavily search, sandboxed Python REPL); 3. Verifier: Determines whether to continue the loop; 4. Memory: Manages in-task state, with plans to integrate Qdrant for cross-task persistence in the future. The training goal is to enable the model to make better decisions at each step, rather than just guessing the final answer correctly.

4

Section 04

Technical Innovations: DAPO Algorithm and Learned Process Reward Model

Contribution 1: Complete DAPO Implementation: Added a dynamic sampling module to the GRPOTrainer from TRL 1.4—before training, sample candidate prompts G times, score them with PRM, and discard samples with near-zero variance (pstdev < 1e-3) to ensure effective training signals.

Contribution 2: Learned PRM: Does not rely on heuristic rules, trained through four stages: collect AIME trajectories of untrained agents → use DeepSeek as LLM Judge to automatically label steps (0-1 score, cost < $1) → train a sequence regression head based on Qwen3-0.6B (MSE loss) → score Planner actions in real time (0 points for format errors, 0-1 range otherwise). Note: PRM evaluates the decision itself and intentionally excludes tool return results.

5

Section 05

Engineering Practice Highlights: Performance Optimization and Robustness Design

  1. 53x Performance Improvement: Switched Ollama endpoint from /v1 to native /api/chat, fixed the issue where the think:false parameter was ignored, reducing a single call from 11 minutes 27 seconds to 13 seconds;
  2. Syntax-Constrained Output: Planner/Verifier use Pydantic schema to ensure structured output, with a retry degradation mechanism to avoid crashes;
  3. Sandboxed Python REPL: Supports math libraries like sympy (meeting AIME requirements), automatically prints the final expression, and tolerates indentation errors;
  4. Leakage Prevention Evaluation: Training data uses AIME 1983-2023 (918 questions), deduplicated with the 2024 test set to ensure no data leakage.
6

Section 06

Comparative Analysis with Original AgentFlow

Feature AgentFlow (Paper) AgentFlow-Pro
Base Model Qwen2.5-7B Qwen3-8B (bf16 + LoRA)
RL Algorithm Flow-GRPO (Outcome Reward) DAPO (Decoupled Clipping + Dynamic Sampling)
Credit Assignment Trajectory-level Step-level (via Learned PRM)
Reward Model None Qwen3-0.6B Regression Head
Tool Layer Custom Implementation FastMCP Server + Sandboxed Python
LLM Service Not Specified Ollama Native /api/chat
Memory System In-task In-task + Qdrant Cross-task (Planned)
7

Section 07

Practical Significance and Application Prospects

The value of AgentFlow-Pro lies in providing a reproducible path for reliable multi-step reasoning agents: 1. Precisely locate errors (trace back to specific step issues); 2. Improve training efficiency (dynamic sampling filters invalid samples); 3. Reduce annotation costs (LLM Judge is used once, training relies on a lightweight PRM); 4. Support tool learning (fine-grained supervision makes it easy to learn when to call tools). It provides domain developers with a complete open-source pipeline from data collection to deployment.

8

Section 08

Key Insights and Open Source Information

Core Insight of AgentFlow-Pro: The quality of reinforcement learning signals is more important than quantity. By converting sparse trajectory-level feedback into dense step-level supervision + dynamic sampling, even an 8B model can make progress in multi-step reasoning, providing a feasible path for resource-constrained users. The project is open-sourced under the MIT license, with clear code structure and complete documentation, making it a high-quality resource for learning process-supervised reinforcement learning.