Reading

make-agents-cheaper: Optimizing Prompt Cache Hit Rate for Coding Agents with Rust

This article introduces a Rust-implemented CLI tool designed to improve the prompt cache hit rate in coding agent workflows, reducing LLM API call costs through intelligent analysis and restructuring of prompt structures.

Prompt Caching成本优化Rust编码AgentLLM API缓存命中率OpenCodeCursor

Published 2026-05-27 10:48Recent activity 2026-05-27 10:57Estimated read 6 min

Section 01

Introduction / Main Floor: make-agents-cheaper: Optimizing Prompt Cache Hit Rate for Coding Agents with Rust

Section 02

Original Author and Source

Original Author/Maintainer: Just-Agent
Source Platform: GitHub
Original Title: make-agents-cheaper
Original Link: https://github.com/Just-Agent/make-agents-cheaper
Source Publish/Update Time: 2026-05-27T02:48:05Z

Section 03

The Pain of Costs: Hidden Expenses of Coding Agents

As the capabilities of large models like Claude and GPT-4 continue to improve, agent-based coding assistance tools (such as Cursor, Devin, OpenCode, etc.) are transforming software development workflows. However, behind these tools lies a staggering cost of API calls.

A typical coding agent session may include:

System prompts (thousands of tokens)
Project context (file tree, dependencies, code snippets)
Conversation history (accumulated from multiple rounds of interaction)
Current task description

A single request can easily reach tens of thousands of tokens. Based on the pricing of current mainstream models, the cost of a complex task can range from a few cents to several dollars. For teams that use these tools frequently, monthly API bills can reach thousands of dollars.

Section 04

Prompt Caching: An Overlooked Money-Saving Tool

Major LLM providers (OpenAI, Anthropic) all offer a prompt caching mechanism: if the prefix of the current request's prompt highly overlaps with a previous request, the model can reuse the computed KV cache and only perform inference on the new part.

The benefits of cache hits are significant:

Anthropic Claude 3.5 Sonnet: 90% cost reduction for cache-hit parts
OpenAI GPT-4: Cache read price is about 50% of normal input

However, in practical applications, the cache hit rate is often not satisfactory. Why is that?

Section 05

Common Reasons for Cache Invalidation

Unstable prompt structure: Frequent changes in the order of system prompts, context, and user input
Dynamic content contamination: Dynamic fields like timestamps, random IDs, and session identifiers break prefix matching
Improper context window management: Truncation strategies lead to prefix changes
Accumulation of multi-round conversations: Changes in the order and content of historical messages

Section 06

Core Idea of make-agents-cheaper

This project is a Rust-implemented CLI tool focused on analyzing and optimizing the prompt structure of coding agents to maximize cache hit rates.

Section 07

Technical Strategies

Prompt normalization: Standardize prompt formats to eliminate unnecessary format changes
Static/dynamic separation: Separate stable content (system prompts, project structure) from dynamic content (user input, current files)
Prefix stability analysis: Detect which parts can be safely cached
Restructuring recommendations: Provide structural restructuring plans to maximize stable prefixes

Section 08

Why Implement with Rust?

Choosing Rust as the implementation language has its considerations:

Performance: Efficient string operations are needed when processing large codebases and complex prompts
Memory safety: Avoid introducing memory issues when handling user code
Portability: Compile to a single binary file, easy to integrate into various workflows
Modern toolchain: Excellent CLI development ecosystem (clap, serde, tokio, etc.)

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15