# AgentFlow-Pro: A Multi-step Reasoning Agent Training Framework Based on Process-Supervised Reinforcement Learning

> AgentFlow-Pro is a ground-up reimplementation of AgentFlow, introducing a learned Process Reward Model (PRM) and the DAPO algorithm. It upgrades multi-step reasoning agent training from trajectory-level feedback to step-by-step fine-grained supervision, significantly improving credit assignment efficiency.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-30T08:55:58.000Z
- 最近活动: 2026-05-30T09:22:32.306Z
- 热度: 163.6
- 关键词: 强化学习, 过程奖励模型, DAPO, AgentFlow, 多步推理, 智能体训练, Qwen3, PRM, LLM, 开源项目
- 页面链接: https://www.zingnex.cn/en/forum/thread/agentflow-pro
- Canonical: https://www.zingnex.cn/forum/thread/agentflow-pro
- Markdown 来源: floors_fallback

---

## AgentFlow-Pro Framework Guide: Process-Supervised Reinforcement Learning Boosts Multi-step Reasoning Efficiency

AgentFlow-Pro is a modern, ground-up implementation of the ICLR 2026 paper *AgentFlow*. Its core innovations include introducing a learned Process Reward Model (PRM) and the DAPO algorithm, upgrading multi-step reasoning agent training from trajectory-level feedback to step-by-step fine-grained supervision, which significantly improves credit assignment efficiency. This framework addresses the flaws of current outcome-oriented reward mechanisms in multi-step reasoning (e.g., inability to distinguish between good and bad steps, zero-gradient issues), providing a reproducible path for building reliable multi-step reasoning agents.

## Research Background: Limitations of Outcome-Oriented Rewards in Multi-step Reasoning

Current mainstream reinforcement learning methods for large language models (e.g., GRPO, Flow-GRPO) use outcome-oriented reward mechanisms, providing only a single feedback based on the final answer at the end of the task. This design works for single-step tasks but has serious flaws in multi-step reasoning: it cannot distinguish between the quality of decisions at each step (e.g., in a five-step reasoning process, the first step is excellent and the third is redundant, but both get the same gradient); when all trajectories are either all correct or all wrong for a certain prompt, zero gradients are generated, wasting computing resources. AgentFlow-Pro was created to solve this problem.

## Project Overview and Core Architecture Design

AgentFlow-Pro adopts a Planner→Executor→Verifier loop architecture, with only the Planner module being trainable. The core modules include: 1. Planner: Fine-tuned based on Qwen3-8B + LoRA, outputting JSON containing thought, action, and action_input at each step; 2. Executor: Schedules tools (e.g., Tavily search, sandboxed Python REPL); 3. Verifier: Determines whether to continue the loop; 4. Memory: Manages in-task state, with plans to integrate Qdrant for cross-task persistence in the future. The training goal is to enable the model to make better decisions at each step, rather than just guessing the final answer correctly.

## Technical Innovations: DAPO Algorithm and Learned Process Reward Model

**Contribution 1: Complete DAPO Implementation**: Added a dynamic sampling module to the GRPOTrainer from TRL 1.4—before training, sample candidate prompts G times, score them with PRM, and discard samples with near-zero variance (pstdev < 1e-3) to ensure effective training signals.

**Contribution 2: Learned PRM**: Does not rely on heuristic rules, trained through four stages: collect AIME trajectories of untrained agents → use DeepSeek as LLM Judge to automatically label steps (0-1 score, cost < $1) → train a sequence regression head based on Qwen3-0.6B (MSE loss) → score Planner actions in real time (0 points for format errors, 0-1 range otherwise). Note: PRM evaluates the decision itself and intentionally excludes tool return results.

## Engineering Practice Highlights: Performance Optimization and Robustness Design

1. **53x Performance Improvement**: Switched Ollama endpoint from /v1 to native /api/chat, fixed the issue where the think:false parameter was ignored, reducing a single call from 11 minutes 27 seconds to 13 seconds;
2. **Syntax-Constrained Output**: Planner/Verifier use Pydantic schema to ensure structured output, with a retry degradation mechanism to avoid crashes;
3. **Sandboxed Python REPL**: Supports math libraries like sympy (meeting AIME requirements), automatically prints the final expression, and tolerates indentation errors;
4. **Leakage Prevention Evaluation**: Training data uses AIME 1983-2023 (918 questions), deduplicated with the 2024 test set to ensure no data leakage.

## Comparative Analysis with Original AgentFlow

| Feature | AgentFlow (Paper) | AgentFlow-Pro |
|---|---|---|
| Base Model | Qwen2.5-7B | Qwen3-8B (bf16 + LoRA) |
| RL Algorithm | Flow-GRPO (Outcome Reward) | DAPO (Decoupled Clipping + Dynamic Sampling) |
| Credit Assignment | Trajectory-level | Step-level (via Learned PRM) |
| Reward Model | None | Qwen3-0.6B Regression Head |
| Tool Layer | Custom Implementation | FastMCP Server + Sandboxed Python |
| LLM Service | Not Specified | Ollama Native /api/chat |
| Memory System | In-task | In-task + Qdrant Cross-task (Planned) |

## Practical Significance and Application Prospects

The value of AgentFlow-Pro lies in providing a reproducible path for reliable multi-step reasoning agents: 1. Precisely locate errors (trace back to specific step issues); 2. Improve training efficiency (dynamic sampling filters invalid samples); 3. Reduce annotation costs (LLM Judge is used once, training relies on a lightweight PRM); 4. Support tool learning (fine-grained supervision makes it easy to learn when to call tools). It provides domain developers with a complete open-source pipeline from data collection to deployment.

## Key Insights and Open Source Information

Core Insight of AgentFlow-Pro: The quality of reinforcement learning signals is more important than quantity. By converting sparse trajectory-level feedback into dense step-level supervision + dynamic sampling, even an 8B model can make progress in multi-step reasoning, providing a feasible path for resource-constrained users. The project is open-sourced under the MIT license, with clear code structure and complete documentation, making it a high-quality resource for learning process-supervised reinforcement learning.