Zing Forum

Reading

DeepLossless: An Inference-Aware Runtime for AI Programming Agents, Significantly Reducing Token Consumption and Redundant Computation

DeepLossless is an open-source inference-aware runtime system that helps AI programming agents reduce token consumption by up to 36% and redundant planning by 64% through reusing execution states, caching tool results, memorizing failed paths, and persisting execution plans.

AI编程代理推理优化token效率执行状态缓存DeepLossless运行时系统OpenAI兼容Rust
Published 2026-05-20 14:44Recent activity 2026-05-20 15:20Estimated read 6 min
DeepLossless: An Inference-Aware Runtime for AI Programming Agents, Significantly Reducing Token Consumption and Redundant Computation
1

Section 01

[Introduction] DeepLossless: An Inference Optimization Tool for AI Programming Agents, Delivering Significant Cost Reduction and Efficiency Improvement

DeepLossless is an open-source inference-aware runtime system designed specifically for AI programming agents. It helps AI programming agents reduce token consumption by up to 36% and redundant planning by 64% through methods like reusing execution states, caching tool results, memorizing failed paths, and persisting execution plans, effectively addressing the pain point of repeated inference in long sessions.

2

Section 02

Background: The Hidden Cost Problem of AI Programming Agents

With the widespread application of LLMs in programming assistance, developers have found that long sessions have significant hidden costs from repeated inference: repeatedly reading unchanged files, re-planning the same tasks, retrying known failed solutions, etc. These not only consume API quotas but also slow down the development pace. These issues led to the creation of DeepLossless.

3

Section 03

Core Design: Execution State as Memory, Two-Layer Agent Architecture

DeepLossless's design philosophy is 'Long context windows are not memory; repeated inference is waste.' It adopts a two-layer agent architecture:

Semantic DAG

  • Embedding deduplication (automatic merging when cosine similarity ≥0.85)
  • BM25 retrieval for fast information location
  • Sentence-level traceability

Execution Memory System

  • Tool result caching (deterministic hashing + partial file invalidation)
  • Failed path memory (recording failed paths to avoid loops)
  • Plan persistence (storing execution states instead of text)
  • Code difference memory (recording changes instead of full code)
  • Abstracted inference trajectory (compressing verbose inference processes)
4

Section 04

Runtime Strategies & API Design: Flexible Configuration and Transparent Integration

Configurable Runtime Strategies

Configuration Mode Cache Rate Retry Count Speculative Execution Context Ratio Freeze Plan Token Budget
Minimal 100% 1 No 20% Yes 30%
Efficient 80% 2 No 50% No 60%
Exploratory 50% 3 Yes 80% No 80%
Autonomous 30% 5 Yes 100% No 95%
Custom User-defined User-defined User-defined User-defined User-defined User-defined

API Design

  • Transparent proxy endpoint: POST /v1/chat/completions (OpenAI-compatible)
  • LCM endpoint: Provides functions like search, expansion, status query, traceability, compression, rollback, etc.
  • Monitoring: Prometheus metrics endpoint and runtime reports
5

Section 05

Performance Testing: 36% Reduction in Token Consumption, 64% Reduction in Redundant Planning

In a long session test with 3 tasks and 86 rounds:

Metric Regular Agent DeepLossless Reduction
Total Token Count 21070 13500 ↓36%
Redundant Planning Count 14 5 ↓64%
Redundant Failure Count 8 3 ↓62%
Repository Re-read Count 11 2 (9 avoided) -

The optimization does not depend on specific models and supports any model in OpenAI API format.

6

Section 06

Use Cases: More Suitable for Long Sessions, Iterative Development, etc.

DeepLossless is particularly suitable for the following scenarios:

  1. Long programming sessions (multiple related tasks)
  2. Iterative development (frequent modification and debugging)
  3. Resource-constrained environments (limited token budget)
  4. Automated workflows (CI/CD pipeline integration)
7

Section 07

Conclusion: Runtime Optimization is Key to AI Agent Efficiency Improvement

DeepLossless optimizes the runtime system to make AI agents work smarter, rather than relying on larger models or longer contexts. Its design draws on incremental compilation ideas and emphasizes the importance of runtime-level optimization. The project is implemented in Rust, with reliable performance, and is a noteworthy open-source project for reducing the cost of AI programming agents.