# AgenticCodingBench: An LLM Inference Benchmark Tool Designed for Agentic Programming Scenarios

> AgenticCodingBench, open-sourced by SwarmOne, is the first LLM inference benchmark tool specifically designed for agentic programming workloads. It can simulate multi-turn context growth scenarios in real coding sessions and measure key metrics such as TTFT, token throughput, and cache hit rate.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-10T11:04:24.000Z
- 最近活动: 2026-04-10T11:16:06.182Z
- 热度: 161.8
- 关键词: LLM, benchmark, agentic-coding, inference, RAG, performance-testing, SwarmOne, vLLM, SGLang
- 页面链接: https://www.zingnex.cn/en/forum/thread/agenticcodingbench-llm
- Canonical: https://www.zingnex.cn/forum/thread/agenticcodingbench-llm
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: AgenticCodingBench: An LLM Inference Benchmark Tool Designed for Agentic Programming Scenarios

AgenticCodingBench, open-sourced by SwarmOne, is the first LLM inference benchmark tool specifically designed for agentic programming workloads. It can simulate multi-turn context growth scenarios in real coding sessions and measure key metrics such as TTFT, token throughput, and cache hit rate.

## Background: Why Do We Need a Specialized Agentic Programming Benchmark?

When Claude Code opens a file, reads 2000 lines of code, edits three functions, runs tests, and reads error outputs, this involves more than 5 rounds of LLM interactions, with each round's context window ranging from 40K to 83K tokens and accumulating as the session progresses. This scenario is fundamentally different from ordinary chatbot requests.

Existing benchmarks have obvious limitations:

- **SWE-bench** focuses on the model's ability to solve GitHub issues but does not measure inference speed
- **LMSys/Chatbot Arena** tests throughput in scenarios with around 2K context, while agentic programming contexts are usually 20-80 times larger than this
- **General LLM benchmarks** send uniformly distributed requests, while agentic programming includes system prompts, tool mode definitions, multi-turn conversation history, code files, and a continuously growing context window

AgenticCodingBench was created to fill this gap; it can benchmark LLM service stacks against real access patterns generated by tools like Claude Code, Cursor, Windsurf, and Copilot.

## Realistic Agentic Programming Contexts

AgenticCodingBench's requests are filled with realistic coding session content, including:

- System prompts with tool definitions (Read, Write, Edit, Bash, Grep, etc.)
- Previous conversation rounds containing file content
- Tool call results and error traces
- Continuously growing context that simulates real session evolution

## Dynamic Context Growth Simulation

The tool can simulate the context growth process in coding sessions:

| Context Configuration | Token Count | Simulated Scenario |
|-----------------------|-------------|--------------------|
| fresh                 | ~6K         | Just opened the project — system prompt + first question |
| short                 | ~20K        | After a few rounds of conversation — read several files and made one edit |
| medium                | ~40K        | Mid-session — multiple file reads, tool calls, error traces |
| long                  | ~70K        | Deep session — multiple edits, test runs, debugging loops |
| full                  | ~83K        | Long session near context limit — all accumulated content |

## Prefix Cache Invalidation Mechanism

Each request includes a unique random salt value to ensure that what is measured is true cold-start inference performance, not cache hits. This is crucial for accurately evaluating inference costs.

## Cache Impact Measurement

Using the `--cache-mode both` parameter, the tool first runs a cold-start test and then a warm-start test to show the precise prefix cache acceleration effect. Taking Anthropic as an example, the cost of cached tokens is 1/10 that of uncached ones ($0.30 vs $3.00 per million tokens).

## Reasoning Token Detection

Automatically detects `reasoning_content` in responses, supports reasoning models like DeepSeek R1, o3, and Claude Extended Thinking, and reports the comparison between thinking overhead and visible output latency.

## Three Major Operation Modes

AgenticCodingBench provides three complementary testing modes:
