Zing Forum

Reading

Agentic Plan Caching: Optimizing LLM Agent Efficiency via Semantic Plan Caching and Dynamic Model Selection

An innovative Agentic AI framework that significantly reduces the inference latency and computational costs of LLM Agents by introducing semantic plan caching, dynamic model selection, and semantic memory mechanisms, providing an efficient engineering solution for large-scale AI application deployment.

LLM Agent语义缓存动态模型选择语义记忆推理优化成本优化Agent效率向量检索
Published 2026-05-15 00:45Recent activity 2026-05-15 00:55Estimated read 7 min
Agentic Plan Caching: Optimizing LLM Agent Efficiency via Semantic Plan Caching and Dynamic Model Selection
1

Section 01

Introduction: Core Solutions of the Agentic Plan Caching Framework for Optimizing LLM Agent Efficiency

The Agentic Plan Caching project addresses the pain points of high inference costs and large response delays in the large-scale application of LLM Agents. Through three core technological innovations—semantic plan caching, dynamic model selection, and semantic memory—it significantly improves the operational efficiency of LLM Agents without sacrificing intelligence levels, providing an efficient engineering solution for large-scale AI application deployment.

2

Section 02

Problem Background: Practical Challenges in LLM Agent Efficiency

Modern AI Agents use the 'think-act-observe' loop pattern to complete tasks. Repeated calls to LLMs for decision-making lead to accumulated delays and excessive costs for complex tasks. Taking a data analysis Agent as an example, steps 2 (planning) and 4 (adjusting plans) require frequent LLM calls, and similar tasks tend to generate redundant plans, resulting in computational waste.

3

Section 03

Core Innovation 1: Semantic Plan Caching

Working Principle

Semantic plan caching addresses the limitations of traditional key-value matching. It achieves semantic reuse through query embedding (conversion to semantic vectors), similarity retrieval (cosine similarity threshold judgment), plan adaptation (template + parameter replacement), and dynamic cache updates (LRU eviction, effect tracking, active learning).

Performance Benefits

Cache hits can reduce latency to the millisecond level, cut LLM call costs by 60%-80%, and improve plan consistency.

4

Section 04

Core Innovation 2: Dynamic Model Selection

Task Complexity Evaluation

Evaluate from multiple dimensions: semantic complexity (length, number of concepts, reasoning depth), context dependency (external knowledge, cross-step state, long context), and output requirements (structured, accuracy/creativity, evaluation criteria).

Model Routing Strategy

Select models hierarchically based on task types: use GPT-3.5/Claude 3 Haiku for simple tasks, GPT-4o mini/Claude3 Sonnet for medium tasks, and GPT-4o/Claude3 Opus for complex tasks; adjust based on latency budget, cost constraints, and quality feedback.

Cascaded Reasoning

Lightweight models are tried first; if confidence is insufficient, upgrade to a higher model to balance quality and cost.

5

Section 05

Core Innovation 3: Semantic Memory

Memory Architecture

  • Working Memory: Stores the context of the current task; cleared/archived after the task ends.
  • Episodic Memory: Stores historical task execution records and supports semantic retrieval.
  • Semantic Memory: Extracts general knowledge (standard processes, best practices, etc.) from episodic memory.

Memory Acquisition and Utilization

For new tasks, retrieve similar experiences and apply general knowledge to generate an initial plan; update working memory during execution; archive to long-term memory after completion to achieve 'getting smarter with use'.

6

Section 06

System Architecture and Implementation Key Points

The framework includes four major components:

  • Plan Generator: Parameterizes and instantiates plans when the cache is hit; calls LLM to generate when not hit.
  • Execution Engine: Orchestrates tool calls, tracks status, and handles exceptions.
  • Memory Manager: Implements semantic retrieval and memory maintenance based on vector databases.
  • Model Router: Selects the appropriate LLM based on task characteristics and supports multiple backends.
7

Section 07

Application Scenarios and Deployment Recommendations

Agentic Plan Caching is suitable for the following scenarios:

  • High-frequency repetitive tasks (customer service Q&A, report generation, etc.);
  • Multi-agent collaboration systems;
  • Cost-sensitive applications (B-end products);
  • Real-time interaction scenarios (chatbots, intelligent assistants).
8

Section 08

Conclusion: Important Direction for LLM Agent Engineering

Agentic Plan Caching represents the direction of engineering optimization for LLM Agents, balancing intelligence levels and cost efficiency. As LLM applications move toward production, semantic caching, dynamic model selection, and semantic memory are key technical points that developers need to study in depth.