# DeepLossless: An Inference-Aware Runtime for AI Programming Agents, Significantly Reducing Token Consumption and Redundant Computation

> DeepLossless is an open-source inference-aware runtime system that helps AI programming agents reduce token consumption by up to 36% and redundant planning by 64% through reusing execution states, caching tool results, memorizing failed paths, and persisting execution plans.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-20T06:44:13.000Z
- 最近活动: 2026-05-20T07:20:39.046Z
- 热度: 150.4
- 关键词: AI编程代理, 推理优化, token效率, 执行状态缓存, DeepLossless, 运行时系统, OpenAI兼容, Rust
- 页面链接: https://www.zingnex.cn/en/forum/thread/deeplossless-ai-token
- Canonical: https://www.zingnex.cn/forum/thread/deeplossless-ai-token
- Markdown 来源: floors_fallback

---

## [Introduction] DeepLossless: An Inference Optimization Tool for AI Programming Agents, Delivering Significant Cost Reduction and Efficiency Improvement

DeepLossless is an open-source inference-aware runtime system designed specifically for AI programming agents. It helps AI programming agents reduce token consumption by up to 36% and redundant planning by 64% through methods like reusing execution states, caching tool results, memorizing failed paths, and persisting execution plans, effectively addressing the pain point of repeated inference in long sessions.

## Background: The Hidden Cost Problem of AI Programming Agents

With the widespread application of LLMs in programming assistance, developers have found that long sessions have significant hidden costs from repeated inference: repeatedly reading unchanged files, re-planning the same tasks, retrying known failed solutions, etc. These not only consume API quotas but also slow down the development pace. These issues led to the creation of DeepLossless.

## Core Design: Execution State as Memory, Two-Layer Agent Architecture

DeepLossless's design philosophy is 'Long context windows are not memory; repeated inference is waste.' It adopts a two-layer agent architecture:

### Semantic DAG
- Embedding deduplication (automatic merging when cosine similarity ≥0.85)
- BM25 retrieval for fast information location
- Sentence-level traceability

### Execution Memory System
- Tool result caching (deterministic hashing + partial file invalidation)
- Failed path memory (recording failed paths to avoid loops)
- Plan persistence (storing execution states instead of text)
- Code difference memory (recording changes instead of full code)
- Abstracted inference trajectory (compressing verbose inference processes)

## Runtime Strategies & API Design: Flexible Configuration and Transparent Integration

#### Configurable Runtime Strategies
| Configuration Mode | Cache Rate | Retry Count | Speculative Execution | Context Ratio | Freeze Plan | Token Budget |
|---------|-------|---------|---------|-----------|---------|----------|
| Minimal | 100% | 1 | No | 20% | Yes | 30% |
| Efficient | 80% | 2 | No |50% | No |60% |
| Exploratory |50%|3|Yes|80%|No|80%|
| Autonomous |30%|5|Yes|100%|No|95%|
| Custom | User-defined | User-defined | User-defined | User-defined | User-defined | User-defined |

#### API Design
- Transparent proxy endpoint: `POST /v1/chat/completions` (OpenAI-compatible)
- LCM endpoint: Provides functions like search, expansion, status query, traceability, compression, rollback, etc.
- Monitoring: Prometheus metrics endpoint and runtime reports

## Performance Testing: 36% Reduction in Token Consumption, 64% Reduction in Redundant Planning

In a long session test with 3 tasks and 86 rounds:
| Metric | Regular Agent | DeepLossless | Reduction |
|-----|---------|-------------|------|
| Total Token Count |21070|13500|↓36%|
| Redundant Planning Count |14|5|↓64%|
| Redundant Failure Count |8|3|↓62%|
| Repository Re-read Count |11|2 (9 avoided)|-|

The optimization does not depend on specific models and supports any model in OpenAI API format.

## Use Cases: More Suitable for Long Sessions, Iterative Development, etc.

DeepLossless is particularly suitable for the following scenarios:
1. Long programming sessions (multiple related tasks)
2. Iterative development (frequent modification and debugging)
3. Resource-constrained environments (limited token budget)
4. Automated workflows (CI/CD pipeline integration)

## Conclusion: Runtime Optimization is Key to AI Agent Efficiency Improvement

DeepLossless optimizes the runtime system to make AI agents work smarter, rather than relying on larger models or longer contexts. Its design draws on incremental compilation ideas and emphasizes the importance of runtime-level optimization. The project is implemented in Rust, with reliable performance, and is a noteworthy open-source project for reducing the cost of AI programming agents.