Reading

Pico-vLLM: Implementing an Industrial-Grade LLM Inference Engine from Scratch

How a personal learning project fully replicates the core tech stacks of vLLM and SGLang, achieves an inference speed of 97 tok/s on a single RTX 5070 card, and reaches industrial-grade performance via Prefix Caching and PD separation.

LLM推理vLLMPagedAttentionPrefix CachingTritonCUDA优化分布式推理Qwen学习项目

Published 2026-05-30 20:33Recent activity 2026-05-30 20:50Estimated read 5 min

Pico-vLLM: Implementing an Industrial-Grade LLM Inference Engine from Scratch

Section 01

Pico-vLLM: A Personal Learning Project Replicating Industrial-Grade LLM Inference Engines

Pico-vLLM is a personal learning project by Koas-W (hosted on GitHub) that aims to help developers understand core LLM inference technologies by implementing from scratch the key stacks of vLLM and SGLang. It achieves industrial-level performance: on a single RTX5070 card, it reaches 97 tok/s inference speed (surpassing vLLM's 95 tok/s) with 78% bandwidth utilization. Key optimizations include Prefix Caching and Prefill-Decode (PD) separation. The project targets the Qwen2.5-1.5B model and focuses on teaching rather than replacing production tools.

Section 02

Project Background & Positioning

The project addresses the pain point that reading the source code of vLLM/SGLang is insufficient to build a complete understanding of their internal mechanisms. Positioned as a teaching tool, it's not a production replacement but a way to learn how core components work together. For Qwen2.5-1.5B (bfloat16), it achieves surprising performance: 97 tok/s on RTX5070 (vs vLLM's 95) with 78% bandwidth utilization, proving deep mastery of low-level optimizations.

Section 03

Core Technical Architecture

Model Layer: Handwritten Qwen2.5-1.5B implementation (without using Hugging Face transformers) including RoPE, GQA, SwiGLU, RMSNorm, plus kernel fusions (QKV, gate_up, rotate_half in-place). Kernel Layer: Triton-based custom CUDA kernels (PagedAttention prefill/decode, fused RoPE+KV store, RMSNorm+residual add, SwiGLU) optimized for Tensor Core and reduced HBM access. Scheduling & Cache: Continuous Batching (FCFS scheduler) for GPU utilization; Prefix Caching (block-level BlockManager + token-level radix tree, double ref count, LRU+lazy deletion) leading to a 2.56x average TTFT speedup. Distributed: Tensor Parallelism (NCCL, sync/async); PD separation (heterogeneous parallelism, KV head remapping) reducing ITL from 10ms to 2ms (5.2x) and tail latency from 50ms to 2ms (25x).

Section 04

Performance Data Deep Dive

Consumer Hardware: RTX5070 (PCIe, bfloat16) → 97 tok/s (vLLM:95), 78% bandwidth (vLLM:77). H200: Throughput is 1.05-1.12x better than vLLM in 64-512 input/16-1024 output scenarios; only lags at 8192 input (prefill optimization gap). TTFT: Pico-vLLM is slower (1.19-1.65x) due to prefill kernel differences (future improvement focus).

Section 05

Development Tools & Engineering Practices

CI System: Full test chain (env check → operator tests → single/multi-card inference; CPU-only support). Benchmark: End-to-end comparison with vLLM/SGLang, output JSONL/CSV/Markdown/PNG reports. Profiling: nsys support; cross-hardware comparison (5070 PCIe vs B200 NVLink). For Qwen2.5-1.5B (2000-token requests), CPU overhead is only 6% (good CPU-GPU synergy).

Section 06

Future Roadmap & Key Takeaways

Roadmap: Async TP + inter-layer comm-compute overlap; NIXL for PD transport; Chunked Prefill; COW for prefix blocks; GPU-CPU offload eviction. Takeaways: Implementing from scratch is an effective way to understand complex systems; Pico-vLLM is an excellent learning resource (clear code, docs); personal projects can reach industrial performance; deep understanding of underlying principles is valuable for AI infrastructure.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15