Reading

AgentFlow-Pro: A Multi-step Reasoning Agent Training Framework Based on Process-Supervised Reinforcement Learning

AgentFlow-Pro is a ground-up reimplementation of AgentFlow, introducing a learned Process Reward Model (PRM) and the DAPO algorithm. It upgrades multi-step reasoning agent training from trajectory-level feedback to step-by-step fine-grained supervision, significantly improving credit assignment efficiency.

强化学习过程奖励模型DAPOAgentFlow多步推理智能体训练Qwen3PRMLLM开源项目

Published 2026-05-30 16:55Recent activity 2026-05-30 17:22Estimated read 9 min

AgentFlow-Pro: A Multi-step Reasoning Agent Training Framework Based on Process-Supervised Reinforcement Learning

Section 01

AgentFlow-Pro Framework Guide: Process-Supervised Reinforcement Learning Boosts Multi-step Reasoning Efficiency

AgentFlow-Pro is a modern, ground-up implementation of the ICLR 2026 paper AgentFlow. Its core innovations include introducing a learned Process Reward Model (PRM) and the DAPO algorithm, upgrading multi-step reasoning agent training from trajectory-level feedback to step-by-step fine-grained supervision, which significantly improves credit assignment efficiency. This framework addresses the flaws of current outcome-oriented reward mechanisms in multi-step reasoning (e.g., inability to distinguish between good and bad steps, zero-gradient issues), providing a reproducible path for building reliable multi-step reasoning agents.

Section 02

Research Background: Limitations of Outcome-Oriented Rewards in Multi-step Reasoning

Current mainstream reinforcement learning methods for large language models (e.g., GRPO, Flow-GRPO) use outcome-oriented reward mechanisms, providing only a single feedback based on the final answer at the end of the task. This design works for single-step tasks but has serious flaws in multi-step reasoning: it cannot distinguish between the quality of decisions at each step (e.g., in a five-step reasoning process, the first step is excellent and the third is redundant, but both get the same gradient); when all trajectories are either all correct or all wrong for a certain prompt, zero gradients are generated, wasting computing resources. AgentFlow-Pro was created to solve this problem.

Section 03

Project Overview and Core Architecture Design

AgentFlow-Pro adopts a Planner→Executor→Verifier loop architecture, with only the Planner module being trainable. The core modules include: 1. Planner: Fine-tuned based on Qwen3-8B + LoRA, outputting JSON containing thought, action, and action_input at each step; 2. Executor: Schedules tools (e.g., Tavily search, sandboxed Python REPL); 3. Verifier: Determines whether to continue the loop; 4. Memory: Manages in-task state, with plans to integrate Qdrant for cross-task persistence in the future. The training goal is to enable the model to make better decisions at each step, rather than just guessing the final answer correctly.

Section 04

Technical Innovations: DAPO Algorithm and Learned Process Reward Model

Contribution 1: Complete DAPO Implementation: Added a dynamic sampling module to the GRPOTrainer from TRL 1.4—before training, sample candidate prompts G times, score them with PRM, and discard samples with near-zero variance (pstdev < 1e-3) to ensure effective training signals.

Contribution 2: Learned PRM: Does not rely on heuristic rules, trained through four stages: collect AIME trajectories of untrained agents → use DeepSeek as LLM Judge to automatically label steps (0-1 score, cost < $1) → train a sequence regression head based on Qwen3-0.6B (MSE loss) → score Planner actions in real time (0 points for format errors, 0-1 range otherwise). Note: PRM evaluates the decision itself and intentionally excludes tool return results.

Section 05

Engineering Practice Highlights: Performance Optimization and Robustness Design

53x Performance Improvement: Switched Ollama endpoint from /v1 to native /api/chat, fixed the issue where the think:false parameter was ignored, reducing a single call from 11 minutes 27 seconds to 13 seconds;
Syntax-Constrained Output: Planner/Verifier use Pydantic schema to ensure structured output, with a retry degradation mechanism to avoid crashes;
Sandboxed Python REPL: Supports math libraries like sympy (meeting AIME requirements), automatically prints the final expression, and tolerates indentation errors;
Leakage Prevention Evaluation: Training data uses AIME 1983-2023 (918 questions), deduplicated with the 2024 test set to ensure no data leakage.

Section 06

Comparative Analysis with Original AgentFlow

Feature	AgentFlow (Paper)	AgentFlow-Pro
Base Model	Qwen2.5-7B	Qwen3-8B (bf16 + LoRA)
RL Algorithm	Flow-GRPO (Outcome Reward)	DAPO (Decoupled Clipping + Dynamic Sampling)
Credit Assignment	Trajectory-level	Step-level (via Learned PRM)
Reward Model	None	Qwen3-0.6B Regression Head
Tool Layer	Custom Implementation	FastMCP Server + Sandboxed Python
LLM Service	Not Specified	Ollama Native /api/chat
Memory System	In-task	In-task + Qdrant Cross-task (Planned)

Section 07

Practical Significance and Application Prospects

The value of AgentFlow-Pro lies in providing a reproducible path for reliable multi-step reasoning agents: 1. Precisely locate errors (trace back to specific step issues); 2. Improve training efficiency (dynamic sampling filters invalid samples); 3. Reduce annotation costs (LLM Judge is used once, training relies on a lightweight PRM); 4. Support tool learning (fine-grained supervision makes it easy to learn when to call tools). It provides domain developers with a complete open-source pipeline from data collection to deployment.

Section 08

Key Insights and Open Source Information

Core Insight of AgentFlow-Pro: The quality of reinforcement learning signals is more important than quantity. By converting sparse trajectory-level feedback into dense step-level supervision + dynamic sampling, even an 8B model can make progress in multi-step reasoning, providing a feasible path for resource-constrained users. The project is open-sourced under the MIT license, with clear code structure and complete documentation, making it a high-quality resource for learning process-supervised reinforcement learning.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15