Reading

Panoramic View of Agentic AI Infrastructure: A Deep Review of 72 Top Conference Papers

This community-maintained review systematically organizes 72 top conference papers from 2023 to 2026, comprehensively covering infrastructure optimization techniques for Agentic LLM workloads.

Agentic AILLM基础设施KV CachePrefill-Decode分离系统综述顶会论文推理优化

Published 2026-03-29 12:16Recent activity 2026-03-29 12:23Estimated read 8 min

Section 01

Panoramic View of Agentic AI Infrastructure: A Deep Review of 72 Top Conference Papers (Introduction)

This is an open-source review maintained by the community, systematically organizing 72 Agentic AI infrastructure-related papers published in 12 top conferences such as OSDI, SOSP, ISCA from 2023 to 2026. The review covers seven key technical areas including workload characteristic analysis, Prefill-Decode separation, KV Cache management, etc., and provides an interactive Chinese-English web interface (https://hungchun0201.github.io/agentic-ai-survey/) to serve as a technical map for researchers and engineers. Agentic AI (e.g., AutoGPT, Claude Computer Use) poses new challenges to infrastructure due to features like multi-turn dialogue and tool calling, and this review aims to help address these issues.

Section 02

The Rise of Agentic AI and Workload Characteristics

Since 2023, Agentic AI systems (LLM-centered intelligent agents) have quickly become a focus, with representative cases including AutoGPT, Claude's Computer Use, Devin, and various coding assistants. Compared to traditional LLM inference, Agentic workloads have features like multi-turn dialogue, tool calling, long context retention, and dynamic task planning, which pose brand-new challenges to the underlying infrastructure. The S1 area of this review (5 papers) focuses on workload characteristic analysis, including traffic patterns, CPU bottleneck identification and optimization, and system sustainability assessment, providing data support for subsequent optimizations.

Section 03

Core Optimization Directions: Prefill-Decode Separation and KV Cache Management

The S2 area of the review (13 papers) focuses on Prefill-Decode separation (a current hot optimization direction). In traditional LLM inference, resource sharing between Prefill and Decode leads to imbalance. Representative works include DistServe (OSDI'24, throughput optimization), Splitwise (ISCA'24, scheduling strategy), Mooncake (FAST'25, best paper, KV Cache-centric separation architecture), etc. The S3 area (18 papers) focuses on KV Cache management (the memory bottleneck of LLM inference), with key works including vLLM (SOSP'23, PagedAttention), SGLang (NeurIPS'24, RadixAttention prefix caching), CacheBlend (EuroSys'25, non-prefix KV reuse), etc.

Section 04

Advanced Optimizations: KV Lifecycle, Scheduling, and Adjacent Technologies

The S4 area (4 papers) addresses the inference pause problem caused by tool calling in Agentic scenarios, researching KV Cache lifecycle management, such as InferCept (ICML'24, KV retention during tool calling), Concur (AIMD admission control), etc. The S5 area (11 papers) focuses on scheduling and routing to solve multi-agent collaboration problems, with representative works including Autellix (program-level DAG scheduling), Preble (ICLR'25, cluster-level KV-aware scheduling), etc. The S6 area (10 papers) introduces reinforcement learning to optimize caching strategies, such as LeCaR (regret-minimizing weighting), RLCache (multi-task RL). The S7 area (11 papers) covers adjacent optimizations, such as Sarathi-Serve (OSDI'24, chunked Prefill), FlashInfer (MLSys'25, best paper, customizable attention engine).

Section 05

Top Conference Paper Evidence and Key Contributions

The 72 papers included in this review are all from top conferences like OSDI, SOSP, ISCA, FAST, MLSys, NeurIPS, ICML, EuroSys, ASPLOS, NSDI, ATC, SIGCOMM. Among them, Mooncake (FAST'25) and FlashInfer (MLSys'25) won best papers, vLLM (SOSP'23)'s PagedAttention pioneered a new direction in KV Cache management, and DistServe (OSDI'24) promoted the practical application of Prefill-Decode separation. These papers provide solid academic evidence for Agentic AI infrastructure optimization.

Section 06

Technical Trends in Agentic AI Infrastructure

Through review analysis, four major technical trends are visible: 1. Evolution from unified architecture to separated architecture (Prefill-Decode separation has become a consensus; more dedicated phases may be subdivided in the future); 2. KV Cache becomes the core optimization object (management complexity increases in Agentic scenarios, and research remains active); 3. Intelligent scheduling decisions (shift from static heuristics to learning-based dynamic strategies); 4. Multi-agent collaboration optimization (single-agent optimization is mature, and multi-agent collaboration will become a hot topic).

Section 07

Practical Value and Usage Recommendations

This review has differentiated value for different roles: system researchers can quickly locate cutting-edge topics; algorithm engineers can understand the underlying optimization principles to guide model design; infrastructure teams can obtain reference cases for architecture design; technical decision-makers can grasp trends to formulate R&D roadmaps. It is recommended to bookmark and study this knowledge treasure—both beginners and experts can benefit from it.

Section 08

Conclusion: The Future of Agentic AI Infrastructure

Agentic AI is moving from the laboratory to production environments, and infrastructure maturity directly affects the speed of implementation. This review not only systematically organizes existing research but also points out future innovation directions. This review covering 72 top conference papers is a valuable resource in the field of Agentic AI infrastructure and deserves attention from all relevant practitioners.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15