Zing Forum

Reading

Deep Research on LLM Inference Systems: From KV Cache to Production-Level Benchmarking

A research-grade repository for ML infrastructure interviews, systematically exploring KV cache behavior, scheduling strategies, and performance benchmarking methodologies in LLM services.

LLM推理KV缓存基准测试模型服务调度策略延迟优化vLLMModal
Published 2026-06-08 03:44Recent activity 2026-06-08 03:52Estimated read 8 min
Deep Research on LLM Inference Systems: From KV Cache to Production-Level Benchmarking
1

Section 01

Guide to the LLM Inference System Deep Research Repository

The GitHub repository llm-inference-benchmark (released on 2026-06-07) maintained by devinnicholson is a research-grade learning resource derived from the 568 Systems and Machine Learning course. Its core goal is to build inference system artifacts for ML infrastructure interviews, systematically exploring KV cache behavior, scheduling strategies, and performance benchmarking methodologies in LLM services. The project emphasizes first clarifying measurement models through simplified simulators, then transitioning to real inference engines (e.g., vLLM, TensorRT-LLM), helping learners understand the core logic of inference systems and prepare for interview questions.

2

Section 02

Project Background and Positioning

Project Source

  • Original author/maintainer: devinnicholson
  • Source platform: GitHub
  • Original title: llm-inference-benchmark
  • Release time: 2026-06-07

Positioning and Goals

The project is a research-grade learning repository whose core goal is to build inference system artifacts usable for ML infrastructure interviews, including workload definition, request lifecycle tracking, benchmarking methodology, scheduler experiments, KV cache pressure research, and real engine comparisons. Unlike tools that only focus on metrics, it emphasizes understanding the measurement model itself—first clarifying logic via simulators, then connecting to real GPUs or inference engines.

3

Section 03

Core Methods and Concepts

Core Concept Terms

  • Latency metrics: TTFT (Time To First Token), TPOT (Time Per Output Token), p95/p99 tail latency, end-to-end latency
  • KV cache related: KV-cache footprint (memory usage), Active KV-cache timeline, Memory pressure
  • Scheduling strategies: FIFO (First-In-First-Out), Shortest-cache (prioritize requests with small cache), Memory-aware-deadline (consider memory and deadlines)

Key Method: Request Lifecycle Simulator

The Week1 artifact provides a simplified simulator with core components:

  1. Workload pattern definition: Supports parameters like input/output token count and arrival time to generate deterministic bursty workloads;
  2. Request lifecycle tracking: Decomposes into 5 stages (queue waiting, tokenization, prefill, decoding, streaming) to generate detailed tracking data.
4

Section 04

Experimental Evidence and Practice

Experimental Evidence

  1. Week1 simulator run examples:

    • Basic workload: python3 scripts/replay_workload.py workloads/week01_mixed_requests.json --model-config configs/models/llama-7b-gqa-fp16.json
    • Generate bursty workload: python3 scripts/generate_workload.py mixed_bursty --requests 32 --seed 568 --output workloads/generated/mixed_bursty_32_seed568.json
    • Compare scheduling strategies: Difference test between FIFO and Shortest-cache
  2. Capacity-aware scheduling experiments:

    • Run capacity sweep: python3 scripts/run_sweep.py
    • Restricted KV cache test: python3 scripts/replay_workload.py ... --capacity-config configs/capacity/tight-1gb-kv.json
  3. Modal cloud execution:

    • GPU probe: modal run modal_app.py --mode gpu-probe
    • vLLM-related tests: Inference baseline, streaming, concurrent workloads, etc.

Experimental results are output to the results/ directory, supporting JSON/CSV format analysis.

5

Section 05

KV Cache Research Focus and Challenges

KV Cache Research Focus

KV cache is a key optimization for Transformer inference (avoids repeated computation), but it faces three major challenges:

  1. Memory usage: Grows with batch size and sequence length, occupying large GPU memory;
  2. Fragmentation: Different sequence lengths lead to memory fragmentation;
  3. Eviction strategy: Need to decide which KV data to discard when the cache is full.

The project systematically explores these issues through capacity-aware scheduling experiments (e.g., experiment-001-capacity-sweep).

6

Section 06

Conclusions and Future Directions

Project Value Summary

This repository is an excellent example of ML system education:

  • Provides a progressive learning path from simulator to real backend, allowing core concepts to be understood without expensive GPUs;
  • Emphasizes the measurement model itself, helping learners master the underlying logic of inference systems;
  • Covers high-frequency interview questions (e.g., request lifecycle, latency metric optimization, benchmarking reproduction).

Future Roadmap

  • Integrate real inference backends (vLLM, SGLang, TensorRT-LLM);
  • Triton kernel optimization;
  • Distributed inference and placement strategies;
  • Improve workload realism based on tracking.

Learning Suggestions

Suitable for engineers and researchers who want to deeply understand LLM inference systems. They can iterate quickly via local simulators and then validate results on the cloud.