Zing Forum

Reading

Panoramic View of Agentic AI Infrastructure: A Deep Review of 72 Top Conference Papers

This community-maintained review systematically organizes 72 top conference papers from 2023 to 2026, comprehensively covering infrastructure optimization techniques for Agentic LLM workloads.

Agentic AILLM基础设施KV CachePrefill-Decode分离系统综述顶会论文推理优化
Published 2026-03-29 12:16Recent activity 2026-03-29 12:23Estimated read 8 min
Panoramic View of Agentic AI Infrastructure: A Deep Review of 72 Top Conference Papers
1

Section 01

Panoramic View of Agentic AI Infrastructure: A Deep Review of 72 Top Conference Papers (Introduction)

This is an open-source review maintained by the community, systematically organizing 72 Agentic AI infrastructure-related papers published in 12 top conferences such as OSDI, SOSP, ISCA from 2023 to 2026. The review covers seven key technical areas including workload characteristic analysis, Prefill-Decode separation, KV Cache management, etc., and provides an interactive Chinese-English web interface (https://hungchun0201.github.io/agentic-ai-survey/) to serve as a technical map for researchers and engineers. Agentic AI (e.g., AutoGPT, Claude Computer Use) poses new challenges to infrastructure due to features like multi-turn dialogue and tool calling, and this review aims to help address these issues.

2

Section 02

The Rise of Agentic AI and Workload Characteristics

Since 2023, Agentic AI systems (LLM-centered intelligent agents) have quickly become a focus, with representative cases including AutoGPT, Claude's Computer Use, Devin, and various coding assistants. Compared to traditional LLM inference, Agentic workloads have features like multi-turn dialogue, tool calling, long context retention, and dynamic task planning, which pose brand-new challenges to the underlying infrastructure. The S1 area of this review (5 papers) focuses on workload characteristic analysis, including traffic patterns, CPU bottleneck identification and optimization, and system sustainability assessment, providing data support for subsequent optimizations.

3

Section 03

Core Optimization Directions: Prefill-Decode Separation and KV Cache Management

The S2 area of the review (13 papers) focuses on Prefill-Decode separation (a current hot optimization direction). In traditional LLM inference, resource sharing between Prefill and Decode leads to imbalance. Representative works include DistServe (OSDI'24, throughput optimization), Splitwise (ISCA'24, scheduling strategy), Mooncake (FAST'25, best paper, KV Cache-centric separation architecture), etc. The S3 area (18 papers) focuses on KV Cache management (the memory bottleneck of LLM inference), with key works including vLLM (SOSP'23, PagedAttention), SGLang (NeurIPS'24, RadixAttention prefix caching), CacheBlend (EuroSys'25, non-prefix KV reuse), etc.

4

Section 04

Advanced Optimizations: KV Lifecycle, Scheduling, and Adjacent Technologies

The S4 area (4 papers) addresses the inference pause problem caused by tool calling in Agentic scenarios, researching KV Cache lifecycle management, such as InferCept (ICML'24, KV retention during tool calling), Concur (AIMD admission control), etc. The S5 area (11 papers) focuses on scheduling and routing to solve multi-agent collaboration problems, with representative works including Autellix (program-level DAG scheduling), Preble (ICLR'25, cluster-level KV-aware scheduling), etc. The S6 area (10 papers) introduces reinforcement learning to optimize caching strategies, such as LeCaR (regret-minimizing weighting), RLCache (multi-task RL). The S7 area (11 papers) covers adjacent optimizations, such as Sarathi-Serve (OSDI'24, chunked Prefill), FlashInfer (MLSys'25, best paper, customizable attention engine).

5

Section 05

Top Conference Paper Evidence and Key Contributions

The 72 papers included in this review are all from top conferences like OSDI, SOSP, ISCA, FAST, MLSys, NeurIPS, ICML, EuroSys, ASPLOS, NSDI, ATC, SIGCOMM. Among them, Mooncake (FAST'25) and FlashInfer (MLSys'25) won best papers, vLLM (SOSP'23)'s PagedAttention pioneered a new direction in KV Cache management, and DistServe (OSDI'24) promoted the practical application of Prefill-Decode separation. These papers provide solid academic evidence for Agentic AI infrastructure optimization.

6

Section 06

Technical Trends in Agentic AI Infrastructure

Through review analysis, four major technical trends are visible: 1. Evolution from unified architecture to separated architecture (Prefill-Decode separation has become a consensus; more dedicated phases may be subdivided in the future); 2. KV Cache becomes the core optimization object (management complexity increases in Agentic scenarios, and research remains active); 3. Intelligent scheduling decisions (shift from static heuristics to learning-based dynamic strategies); 4. Multi-agent collaboration optimization (single-agent optimization is mature, and multi-agent collaboration will become a hot topic).

7

Section 07

Practical Value and Usage Recommendations

This review has differentiated value for different roles: system researchers can quickly locate cutting-edge topics; algorithm engineers can understand the underlying optimization principles to guide model design; infrastructure teams can obtain reference cases for architecture design; technical decision-makers can grasp trends to formulate R&D roadmaps. It is recommended to bookmark and study this knowledge treasure—both beginners and experts can benefit from it.

8

Section 08

Conclusion: The Future of Agentic AI Infrastructure

Agentic AI is moving from the laboratory to production environments, and infrastructure maturity directly affects the speed of implementation. This review not only systematically organizes existing research but also points out future innovation directions. This review covering 72 top conference papers is a valuable resource in the field of Agentic AI infrastructure and deserves attention from all relevant practitioners.