Reading

AgenticCodingBench: An LLM Inference Benchmark Tool Designed for Agentic Programming Scenarios

AgenticCodingBench, open-sourced by SwarmOne, is the first LLM inference benchmark tool specifically designed for agentic programming workloads. It can simulate multi-turn context growth scenarios in real coding sessions and measure key metrics such as TTFT, token throughput, and cache hit rate.

LLMbenchmarkagentic-codinginferenceRAGperformance-testingSwarmOnevLLMSGLang

Published 2026-04-10 19:04Recent activity 2026-04-10 19:16Estimated read 6 min

Section 01

Introduction / Main Floor: AgenticCodingBench: An LLM Inference Benchmark Tool Designed for Agentic Programming Scenarios

Section 02

Background: Why Do We Need a Specialized Agentic Programming Benchmark?

When Claude Code opens a file, reads 2000 lines of code, edits three functions, runs tests, and reads error outputs, this involves more than 5 rounds of LLM interactions, with each round's context window ranging from 40K to 83K tokens and accumulating as the session progresses. This scenario is fundamentally different from ordinary chatbot requests.

Existing benchmarks have obvious limitations:

SWE-bench focuses on the model's ability to solve GitHub issues but does not measure inference speed
LMSys/Chatbot Arena tests throughput in scenarios with around 2K context, while agentic programming contexts are usually 20-80 times larger than this
General LLM benchmarks send uniformly distributed requests, while agentic programming includes system prompts, tool mode definitions, multi-turn conversation history, code files, and a continuously growing context window

AgenticCodingBench was created to fill this gap; it can benchmark LLM service stacks against real access patterns generated by tools like Claude Code, Cursor, Windsurf, and Copilot.

Section 03

Realistic Agentic Programming Contexts

AgenticCodingBench's requests are filled with realistic coding session content, including:

System prompts with tool definitions (Read, Write, Edit, Bash, Grep, etc.)
Previous conversation rounds containing file content
Tool call results and error traces
Continuously growing context that simulates real session evolution

Section 04

Dynamic Context Growth Simulation

The tool can simulate the context growth process in coding sessions:

Context Configuration	Token Count	Simulated Scenario
fresh	~6K	Just opened the project — system prompt + first question
short	~20K	After a few rounds of conversation — read several files and made one edit
medium	~40K	Mid-session — multiple file reads, tool calls, error traces
long	~70K	Deep session — multiple edits, test runs, debugging loops
full	~83K	Long session near context limit — all accumulated content

Section 05

Prefix Cache Invalidation Mechanism

Each request includes a unique random salt value to ensure that what is measured is true cold-start inference performance, not cache hits. This is crucial for accurately evaluating inference costs.

Section 06

Cache Impact Measurement

Using the --cache-mode both parameter, the tool first runs a cold-start test and then a warm-start test to show the precise prefix cache acceleration effect. Taking Anthropic as an example, the cost of cached tokens is 1/10 that of uncached ones ($0.30 vs $3.00 per million tokens).

Section 07

Reasoning Token Detection

Automatically detects reasoning_content in responses, supports reasoning models like DeepSeek R1, o3, and Claude Extended Thinking, and reports the comparison between thinking overhead and visible output latency.

Section 08

Three Major Operation Modes

AgenticCodingBench provides three complementary testing modes:

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15