Zing Forum

Reading

Pico-vLLM: Implementing an Industrial-Grade LLM Inference Engine from Scratch

How a personal learning project fully replicates the core tech stacks of vLLM and SGLang, achieves an inference speed of 97 tok/s on a single RTX 5070 card, and reaches industrial-grade performance via Prefix Caching and PD separation.

LLM推理vLLMPagedAttentionPrefix CachingTritonCUDA优化分布式推理Qwen学习项目
Published 2026-05-30 20:33Recent activity 2026-05-30 20:50Estimated read 5 min
Pico-vLLM: Implementing an Industrial-Grade LLM Inference Engine from Scratch
1

Section 01

Pico-vLLM: A Personal Learning Project Replicating Industrial-Grade LLM Inference Engines

Pico-vLLM is a personal learning project by Koas-W (hosted on GitHub) that aims to help developers understand core LLM inference technologies by implementing from scratch the key stacks of vLLM and SGLang. It achieves industrial-level performance: on a single RTX5070 card, it reaches 97 tok/s inference speed (surpassing vLLM's 95 tok/s) with 78% bandwidth utilization. Key optimizations include Prefix Caching and Prefill-Decode (PD) separation. The project targets the Qwen2.5-1.5B model and focuses on teaching rather than replacing production tools.

2

Section 02

Project Background & Positioning

The project addresses the pain point that reading the source code of vLLM/SGLang is insufficient to build a complete understanding of their internal mechanisms. Positioned as a teaching tool, it's not a production replacement but a way to learn how core components work together. For Qwen2.5-1.5B (bfloat16), it achieves surprising performance: 97 tok/s on RTX5070 (vs vLLM's 95) with 78% bandwidth utilization, proving deep mastery of low-level optimizations.

3

Section 03

Core Technical Architecture

Model Layer: Handwritten Qwen2.5-1.5B implementation (without using Hugging Face transformers) including RoPE, GQA, SwiGLU, RMSNorm, plus kernel fusions (QKV, gate_up, rotate_half in-place). Kernel Layer: Triton-based custom CUDA kernels (PagedAttention prefill/decode, fused RoPE+KV store, RMSNorm+residual add, SwiGLU) optimized for Tensor Core and reduced HBM access. Scheduling & Cache: Continuous Batching (FCFS scheduler) for GPU utilization; Prefix Caching (block-level BlockManager + token-level radix tree, double ref count, LRU+lazy deletion) leading to a 2.56x average TTFT speedup. Distributed: Tensor Parallelism (NCCL, sync/async); PD separation (heterogeneous parallelism, KV head remapping) reducing ITL from 10ms to 2ms (5.2x) and tail latency from 50ms to 2ms (25x).

4

Section 04

Performance Data Deep Dive

Consumer Hardware: RTX5070 (PCIe, bfloat16) → 97 tok/s (vLLM:95), 78% bandwidth (vLLM:77). H200: Throughput is 1.05-1.12x better than vLLM in 64-512 input/16-1024 output scenarios; only lags at 8192 input (prefill optimization gap). TTFT: Pico-vLLM is slower (1.19-1.65x) due to prefill kernel differences (future improvement focus).

5

Section 05

Development Tools & Engineering Practices

CI System: Full test chain (env check → operator tests → single/multi-card inference; CPU-only support). Benchmark: End-to-end comparison with vLLM/SGLang, output JSONL/CSV/Markdown/PNG reports. Profiling: nsys support; cross-hardware comparison (5070 PCIe vs B200 NVLink). For Qwen2.5-1.5B (2000-token requests), CPU overhead is only 6% (good CPU-GPU synergy).

6

Section 06

Future Roadmap & Key Takeaways

Roadmap: Async TP + inter-layer comm-compute overlap; NIXL for PD transport; Chunked Prefill; COW for prefix blocks; GPU-CPU offload eviction. Takeaways: Implementing from scratch is an effective way to understand complex systems; Pico-vLLM is an excellent learning resource (clear code, docs); personal projects can reach industrial performance; deep understanding of underlying principles is valuable for AI infrastructure.