Reading

Building from Scratch for 15x Speedup: A Technical Deep Dive into a Pure PyTorch LLM Inference Engine

This article provides an in-depth analysis of an LLM inference engine built from scratch, which achieves a 15x throughput improvement over naive inference on a T4 GPU through three core technologies: continuous batching, paged KV cache, and dynamic injection.

LLM推理PyTorchKV缓存连续批处理vLLMGPU优化大语言模型推理引擎T4 GPU开源项目

Published 2026-06-13 14:13Recent activity 2026-06-13 14:19Estimated read 7 min

Building from Scratch for 15x Speedup: A Technical Deep Dive into a Pure PyTorch LLM Inference Engine

Section 01

[Introduction] Pure PyTorch LLM Inference Engine: Three Core Technologies Behind the 15x Speedup

This article analyzes an open-source project of a pure PyTorch LLM inference engine built from scratch. It achieves a 15x throughput improvement over naive inference on a T4 GPU through three core technologies: continuous batching, paged KV cache, and dynamic injection. The project does not rely on black-box encapsulation; it disassembles core components like the scheduler and KV cache, providing developers with an opportunity to learn the underlying mechanisms of modern inference systems.

Section 02

Three Core Bottlenecks of Traditional LLM Inference

Traditional LLM inference faces three major challenges:

GPU Idleness and Static Batching Inefficiency: Static batching requires waiting for all requests in a batch to complete before starting a new one, leading to GPU resource waste;
KV Cache Memory Fragmentation: Naive implementations pre-allocate memory for the maximum sequence length for each request, causing over-allocation and memory waste;
New Request Queuing Delay: In static batching architectures, new requests must wait for the current batch to finish, leading to severe delay accumulation under high concurrency.

Section 03

Three Technical Pillars: Paged KV Cache, Continuous Batching, and Dynamic Injection

The project proposes three solutions to address these bottlenecks:

Paged KV Cache: Drawing inspiration from operating system virtual memory management, it divides the KV cache into fixed-size pages and dynamically allocates them on demand, improving memory utilization and reducing fragmentation;
Continuous Batching: The scheduler maintains a waiting queue and fills new requests immediately after existing ones complete, breaking batch boundaries and keeping the GPU busy;
Dynamic Request Injection: Allows injecting new requests during the decode phase, mixing prefill and decode tasks to fully utilize GPU computing power and memory bandwidth.

Section 04

Performance Test Results: Evidence of 15x Throughput Improvement on T4 GPU

On the T4 GPU in Google Colab, the project achieves significant performance improvements:

Mode	Throughput
Naive Inference (Single Request)	~30 tokens/sec
This Engine (Continuous Batching, batch=8)	458 tokens/sec
Performance Improvement	~15x
This result is achieved through the synergy of the three core technologies.

Section 05

System Architecture: Complete Flow from Request Arrival to Completion

Request processing flow:

A request enters the scheduler's waiting queue; the scheduler decides whether to add it to the current batch based on system load;
The memory manager dynamically allocates KV cache pages from the BlockPool;
The inference engine executes the prefill phase to generate the first token, then enters the decode loop—generating a new token each step while checking request status or injecting new requests. The project has a clear code structure: request.py manages the request lifecycle, scheduler.py implements scheduling logic, memory.py handles paged KV cache, continuous_engine.py implements core inference, and benchmark.py is used for throughput testing.

Section 06

Engineering Insights: Key Understandings of GPU Utilization, Scheduler, and KV Cache

Core insights from the project author:

**GPU Utilization: High memory usage does not necessarily improve performance. If memory bandwidth is already a bottleneck, adding more KV cache blocks will exacerbate resource competition;
Scheduler Importance: The correctness of scheduling logic takes precedence over micro-optimizations; a well-designed scheduler keeps the system stable;
KV Cache Paging Strategy: It is not an optional optimization but a necessity for large-scale deployment, directly affecting the number of concurrent requests.

Section 07

Future Outlook and Community Value: From Student Project to Open-Source Learning Model

Future plans include: integrating the Flash Attention CUDA kernel, implementing speculative decoding, introducing INT8/FP16 quantization, and developing a streaming output API. This project was independently completed by a BCA student, with the motivation to understand the underlying working principles of vLLM—embodying the spirit of learning from first principles in the open-source community.

Section 08

Conclusion: The Path from API Caller to System Understander

In today's rapidly evolving LLM technology landscape, engineers who deeply understand underlying systems have a competitive edge. This project provides clear code and documentation to help developers advance from 'API callers' to 'system understanders'. The 15x performance improvement is the result of deep problem understanding and careful engineering implementation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23