Zing Forum

Reading

make-agents-cheaper: Optimizing Prompt Cache Hit Rate for Coding Agents with Rust

This article introduces a Rust-implemented CLI tool designed to improve the prompt cache hit rate in coding agent workflows, reducing LLM API call costs through intelligent analysis and restructuring of prompt structures.

Prompt Caching成本优化Rust编码AgentLLM API缓存命中率OpenCodeCursor
Published 2026-05-27 10:48Recent activity 2026-05-27 10:57Estimated read 6 min
make-agents-cheaper: Optimizing Prompt Cache Hit Rate for Coding Agents with Rust
1

Section 01

Introduction / Main Floor: make-agents-cheaper: Optimizing Prompt Cache Hit Rate for Coding Agents with Rust

This article introduces a Rust-implemented CLI tool designed to improve the prompt cache hit rate in coding agent workflows, reducing LLM API call costs through intelligent analysis and restructuring of prompt structures.

2

Section 02

Original Author and Source

3

Section 03

The Pain of Costs: Hidden Expenses of Coding Agents

As the capabilities of large models like Claude and GPT-4 continue to improve, agent-based coding assistance tools (such as Cursor, Devin, OpenCode, etc.) are transforming software development workflows. However, behind these tools lies a staggering cost of API calls.

A typical coding agent session may include:

  • System prompts (thousands of tokens)
  • Project context (file tree, dependencies, code snippets)
  • Conversation history (accumulated from multiple rounds of interaction)
  • Current task description

A single request can easily reach tens of thousands of tokens. Based on the pricing of current mainstream models, the cost of a complex task can range from a few cents to several dollars. For teams that use these tools frequently, monthly API bills can reach thousands of dollars.

4

Section 04

Prompt Caching: An Overlooked Money-Saving Tool

Major LLM providers (OpenAI, Anthropic) all offer a prompt caching mechanism: if the prefix of the current request's prompt highly overlaps with a previous request, the model can reuse the computed KV cache and only perform inference on the new part.

The benefits of cache hits are significant:

  • Anthropic Claude 3.5 Sonnet: 90% cost reduction for cache-hit parts
  • OpenAI GPT-4: Cache read price is about 50% of normal input

However, in practical applications, the cache hit rate is often not satisfactory. Why is that?

5

Section 05

Common Reasons for Cache Invalidation

  1. Unstable prompt structure: Frequent changes in the order of system prompts, context, and user input
  2. Dynamic content contamination: Dynamic fields like timestamps, random IDs, and session identifiers break prefix matching
  3. Improper context window management: Truncation strategies lead to prefix changes
  4. Accumulation of multi-round conversations: Changes in the order and content of historical messages
6

Section 06

Core Idea of make-agents-cheaper

This project is a Rust-implemented CLI tool focused on analyzing and optimizing the prompt structure of coding agents to maximize cache hit rates.

7

Section 07

Technical Strategies

  1. Prompt normalization: Standardize prompt formats to eliminate unnecessary format changes
  2. Static/dynamic separation: Separate stable content (system prompts, project structure) from dynamic content (user input, current files)
  3. Prefix stability analysis: Detect which parts can be safely cached
  4. Restructuring recommendations: Provide structural restructuring plans to maximize stable prefixes
8

Section 08

Why Implement with Rust?

Choosing Rust as the implementation language has its considerations:

  • Performance: Efficient string operations are needed when processing large codebases and complex prompts
  • Memory safety: Avoid introducing memory issues when handling user code
  • Portability: Compile to a single binary file, easy to integrate into various workflows
  • Modern toolchain: Excellent CLI development ecosystem (clap, serde, tokio, etc.)