Reading

DeepLossless: An Inference-Aware Runtime for AI Programming Agents, Significantly Reducing Token Consumption and Redundant Computation

DeepLossless is an open-source inference-aware runtime system that helps AI programming agents reduce token consumption by up to 36% and redundant planning by 64% through reusing execution states, caching tool results, memorizing failed paths, and persisting execution plans.

AI编程代理推理优化token效率执行状态缓存DeepLossless运行时系统OpenAI兼容Rust

Published 2026-05-20 14:44Recent activity 2026-05-20 15:20Estimated read 6 min

DeepLossless: An Inference-Aware Runtime for AI Programming Agents, Significantly Reducing Token Consumption and Redundant Computation

Section 01

[Introduction] DeepLossless: An Inference Optimization Tool for AI Programming Agents, Delivering Significant Cost Reduction and Efficiency Improvement

DeepLossless is an open-source inference-aware runtime system designed specifically for AI programming agents. It helps AI programming agents reduce token consumption by up to 36% and redundant planning by 64% through methods like reusing execution states, caching tool results, memorizing failed paths, and persisting execution plans, effectively addressing the pain point of repeated inference in long sessions.

Section 02

Background: The Hidden Cost Problem of AI Programming Agents

With the widespread application of LLMs in programming assistance, developers have found that long sessions have significant hidden costs from repeated inference: repeatedly reading unchanged files, re-planning the same tasks, retrying known failed solutions, etc. These not only consume API quotas but also slow down the development pace. These issues led to the creation of DeepLossless.

Section 03

Core Design: Execution State as Memory, Two-Layer Agent Architecture

DeepLossless's design philosophy is 'Long context windows are not memory; repeated inference is waste.' It adopts a two-layer agent architecture:

Semantic DAG

Embedding deduplication (automatic merging when cosine similarity ≥0.85)
BM25 retrieval for fast information location
Sentence-level traceability

Execution Memory System

Tool result caching (deterministic hashing + partial file invalidation)
Failed path memory (recording failed paths to avoid loops)
Plan persistence (storing execution states instead of text)
Code difference memory (recording changes instead of full code)
Abstracted inference trajectory (compressing verbose inference processes)

Section 04

Runtime Strategies & API Design: Flexible Configuration and Transparent Integration

Configurable Runtime Strategies

Configuration Mode	Cache Rate	Retry Count	Speculative Execution	Context Ratio	Freeze Plan	Token Budget
Minimal	100%	1	No	20%	Yes	30%
Efficient	80%	2	No	50%	No	60%
Exploratory	50%	3	Yes	80%	No	80%
Autonomous	30%	5	Yes	100%	No	95%
Custom	User-defined	User-defined	User-defined	User-defined	User-defined	User-defined

API Design

Transparent proxy endpoint: POST /v1/chat/completions (OpenAI-compatible)
LCM endpoint: Provides functions like search, expansion, status query, traceability, compression, rollback, etc.
Monitoring: Prometheus metrics endpoint and runtime reports

Section 05

Performance Testing: 36% Reduction in Token Consumption, 64% Reduction in Redundant Planning

In a long session test with 3 tasks and 86 rounds:

Metric	Regular Agent	DeepLossless	Reduction
Total Token Count	21070	13500	↓36%
Redundant Planning Count	14	5	↓64%
Redundant Failure Count	8	3	↓62%
Repository Re-read Count	11	2 (9 avoided)	-

The optimization does not depend on specific models and supports any model in OpenAI API format.

Section 06

Use Cases: More Suitable for Long Sessions, Iterative Development, etc.

DeepLossless is particularly suitable for the following scenarios:

Long programming sessions (multiple related tasks)
Iterative development (frequent modification and debugging)
Resource-constrained environments (limited token budget)
Automated workflows (CI/CD pipeline integration)

Section 07

Conclusion: Runtime Optimization is Key to AI Agent Efficiency Improvement

DeepLossless optimizes the runtime system to make AI agents work smarter, rather than relying on larger models or longer contexts. Its design draws on incremental compilation ideas and emphasizes the importance of runtime-level optimization. The project is implemented in Rust, with reliable performance, and is a noteworthy open-source project for reducing the cost of AI programming agents.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15