When Claude Code opens a file, reads 2000 lines of code, edits three functions, runs tests, and reads error outputs, this involves more than 5 rounds of LLM interactions, with each round's context window ranging from 40K to 83K tokens and accumulating as the session progresses. This scenario is fundamentally different from ordinary chatbot requests.
Existing benchmarks have obvious limitations:
- SWE-bench focuses on the model's ability to solve GitHub issues but does not measure inference speed
- LMSys/Chatbot Arena tests throughput in scenarios with around 2K context, while agentic programming contexts are usually 20-80 times larger than this
- General LLM benchmarks send uniformly distributed requests, while agentic programming includes system prompts, tool mode definitions, multi-turn conversation history, code files, and a continuously growing context window
AgenticCodingBench was created to fill this gap; it can benchmark LLM service stacks against real access patterns generated by tools like Claude Code, Cursor, Windsurf, and Copilot.