Zing Forum

Reading

Panoramic View of Large Model Reinforcement Learning Papers: The awesome-agentic Repository Organizes Four Cutting-Edge Directions

The awesome-agentic repository maintained by yingyingxia666 systematically organizes over 200 large model reinforcement learning papers, categorized into four cutting-edge directions: Reasoning RL, Agentic RL, OPD, and Multi-Agent. It is an essential resource for LLM RL research.

大模型强化学习LLM RLReasoning RLAgentic RLGRPO过程奖励模型PRMDeepSeek-R1论文综述
Published 2026-05-25 20:39Recent activity 2026-05-25 20:49Estimated read 8 min
Panoramic View of Large Model Reinforcement Learning Papers: The awesome-agentic Repository Organizes Four Cutting-Edge Directions
1

Section 01

Panoramic View of Large Model Reinforcement Learning Papers: Guide to the Core Value of the awesome-agentic Repository

The GitHub repository awesome-agentic maintained by yingyingxia666 systematically organizes over 200 large model reinforcement learning (LLM RL) papers, categorized into four cutting-edge directions: Reasoning RL, Agentic RL, OPD, and Multi-Agent. It provides a structured knowledge map for researchers and is an essential resource in the LLM RL field.

2

Section 02

Repository Background and Basic Information

Large language model reinforcement learning (RL) is developing explosively, but with many subfields and papers, researchers easily lose track of the context. The awesome-agentic repository addresses this issue:

  • Maintainer: yingyingxia666
  • Source: GitHub (link: https://github.com/yingyingxia666/awesome-agentic)
  • Included: Over 200 papers from January 2023 to May 2026
  • Last updated: May 2026 This repository provides structured categorization to help quickly locate subfields and understand paper connections.
3

Section 03

Cutting-Edge Direction 1: Reasoning Reinforcement Learning (Reasoning RL)

Focuses on single-turn long chain-of-thought reasoning tasks (math, code, formal proof, etc.). The core challenge is the generation and self-correction of long reasoning chains. Key technologies:

  1. RLVR (Verifiable Reward Reinforcement Learning): Uses automatic verification signals (e.g., math answers) as rewards to reduce annotation costs. Representative works: DeepSeek-R1, Tülu3;
  2. GRPO and its variants: A Critic-Free algorithm proposed by DeepSeekMath, followed by DAPO (Asymmetric Clipping), VAPO (Length-Adaptive GAE), Dr.GRPO (Fixing Length Normalization Bias);
  3. Process Reward Model (PRM): Fine-grained step feedback, evolving from manual annotation (PRM800K) to automatic annotation (OmegaPRM, Math-Shepherd) and then to implicit process reward theory (Free Process Rewards).
4

Section 04

Cutting-Edge Direction 2: Agentic Reinforcement Learning (Agentic RL)

Focuses on multi-turn interaction tasks (tool use, web browsing, GUI operations, etc.), characterized by partial observability and long horizon. Core challenges and works:

  • Tool use and multi-turn interaction: SWE-RL, ToolRL, Search-R1 explore tool calling, with the difficulty of credit assignment;
  • GUI and computer operations: GiGPO, SWEET-RL extend to graphical interface operations, requiring visual perception and action decision-making;
  • Memory and long-term planning: RAGEN, HCAPO focus on multi-turn memory maintenance and long-span planning.
5

Section 05

Cutting-Edge Direction 3: OPD (Off-Policy/On-Policy Distillation/Drift)

Focuses on training stability and technical details, which are critical for practical deployment. Key topics:

  1. Off-Policy and Importance Sampling: GSPO, MinPRO, M2PO explore IS clipping strategies to balance sample utilization and stability;
  2. Asynchronous training and system optimization: Asynchronous architectures for large-scale RL training (generator sampling, learner parallel updates), requiring efficient pipelines and memory optimization;
  3. Policy drift monitoring: AReaL, IcePop propose methods to monitor and mitigate policy drift (e.g., length explosion, repeated loops).
6

Section 06

Cutting-Edge Direction4: Multi-Agent Reinforcement Learning (Multi-Agent)

Explores multi-LLM collaboration, competition, or self-play. Core scenarios:

  1. Collaboration and debate: The LLM Debate series improves reasoning accuracy through model mutual critique;
  2. Self-play and self-improvement: AlphaLLM, rStar-Math generate new data via self-play, forming a data flywheel;
  3. Coordinators and game theory: FlowReasoner, eva introduce coordination mechanisms to resolve multi-agent conflicts.
7

Section 07

Technical Trends and Recommendations for Researchers

Technical Trends:

  1. Critic-Free vs Critic-Based Tug-of-War: GRPO (Critic-Free) and VAPO (Critic-Based) each have their advantages;
  2. Automatic Annotation and Synthetic Data: Math-Shepherd, OmegaPRM, etc., explore automatic construction of process supervision signals;
  3. Training-Inference Consistency: TIM research focuses on the inconsistency between training greedy decoding and inference sampling. Recommendations for Researchers:
  4. Getting Started: Read the technical reports of DeepSeek-R1 and Tülu3 to understand the RLVR paradigm;
  5. In-Depth Study: Choose a direction and read surveys (e.g., PRM Survey);
  6. Follow-Up: Pay attention to the latest works like DAPO, VAPO, Magistral;
  7. Practice: Reproduce SimpleRL-Zoo experiments to build intuition.
8

Section 08

Summary of Repository Value and Recommendations

The awesome-agentic repository not only includes over 200 papers but also provides a framework for understanding the field: Reasoning RL pursues single-turn depth, Agentic RL expands multi-turn breadth, OPD solidifies training foundations, and Multi-Agent explores collective intelligence. For LLM RL researchers, it is a rare map—we recommend bookmarking it and revisiting it regularly for updates.