Zing Forum

Reading

AgenticCodingBench: An LLM Inference Benchmark Tool Designed for Agentic Programming Scenarios

AgenticCodingBench, open-sourced by SwarmOne, is the first LLM inference benchmark tool specifically designed for agentic programming workloads. It can simulate multi-turn context growth scenarios in real coding sessions and measure key metrics such as TTFT, token throughput, and cache hit rate.

LLMbenchmarkagentic-codinginferenceRAGperformance-testingSwarmOnevLLMSGLang
Published 2026-04-10 19:04Recent activity 2026-04-10 19:16Estimated read 6 min
AgenticCodingBench: An LLM Inference Benchmark Tool Designed for Agentic Programming Scenarios
1

Section 01

Introduction / Main Floor: AgenticCodingBench: An LLM Inference Benchmark Tool Designed for Agentic Programming Scenarios

AgenticCodingBench, open-sourced by SwarmOne, is the first LLM inference benchmark tool specifically designed for agentic programming workloads. It can simulate multi-turn context growth scenarios in real coding sessions and measure key metrics such as TTFT, token throughput, and cache hit rate.

2

Section 02

Background: Why Do We Need a Specialized Agentic Programming Benchmark?

When Claude Code opens a file, reads 2000 lines of code, edits three functions, runs tests, and reads error outputs, this involves more than 5 rounds of LLM interactions, with each round's context window ranging from 40K to 83K tokens and accumulating as the session progresses. This scenario is fundamentally different from ordinary chatbot requests.

Existing benchmarks have obvious limitations:

  • SWE-bench focuses on the model's ability to solve GitHub issues but does not measure inference speed
  • LMSys/Chatbot Arena tests throughput in scenarios with around 2K context, while agentic programming contexts are usually 20-80 times larger than this
  • General LLM benchmarks send uniformly distributed requests, while agentic programming includes system prompts, tool mode definitions, multi-turn conversation history, code files, and a continuously growing context window

AgenticCodingBench was created to fill this gap; it can benchmark LLM service stacks against real access patterns generated by tools like Claude Code, Cursor, Windsurf, and Copilot.

3

Section 03

Realistic Agentic Programming Contexts

AgenticCodingBench's requests are filled with realistic coding session content, including:

  • System prompts with tool definitions (Read, Write, Edit, Bash, Grep, etc.)
  • Previous conversation rounds containing file content
  • Tool call results and error traces
  • Continuously growing context that simulates real session evolution
4

Section 04

Dynamic Context Growth Simulation

The tool can simulate the context growth process in coding sessions:

Context Configuration Token Count Simulated Scenario
fresh ~6K Just opened the project — system prompt + first question
short ~20K After a few rounds of conversation — read several files and made one edit
medium ~40K Mid-session — multiple file reads, tool calls, error traces
long ~70K Deep session — multiple edits, test runs, debugging loops
full ~83K Long session near context limit — all accumulated content
5

Section 05

Prefix Cache Invalidation Mechanism

Each request includes a unique random salt value to ensure that what is measured is true cold-start inference performance, not cache hits. This is crucial for accurately evaluating inference costs.

6

Section 06

Cache Impact Measurement

Using the --cache-mode both parameter, the tool first runs a cold-start test and then a warm-start test to show the precise prefix cache acceleration effect. Taking Anthropic as an example, the cost of cached tokens is 1/10 that of uncached ones ($0.30 vs $3.00 per million tokens).

7

Section 07

Reasoning Token Detection

Automatically detects reasoning_content in responses, supports reasoning models like DeepSeek R1, o3, and Claude Extended Thinking, and reports the comparison between thinking overhead and visible output latency.

8

Section 08

Three Major Operation Modes

AgenticCodingBench provides three complementary testing modes: