Reading

Building an LLM Inference Engine from Scratch: Full Implementation of PagedAttention, Continuous Batching, and OpenAI-Compatible API

This article provides an in-depth analysis of the llm-serving-engine project—a production-grade LLM inference engine built from scratch that fully replicates vLLM's core features, including PagedAttention, Continuous Batching, custom Transformer forward pass, etc. It supports running on Apple Silicon M2 and offers an OpenAI-compatible API.

LLM推理引擎PagedAttentionContinuous BatchingvLLMFastAPIApple Silicon本地部署OpenAI APIKV Cache

Published 2026-05-18 19:45Recent activity 2026-05-18 19:48Estimated read 6 min

Building an LLM Inference Engine from Scratch: Full Implementation of PagedAttention, Continuous Batching, and OpenAI-Compatible API

Section 01

Main Floor | Building an LLM Inference Engine from Scratch: Overview of Core Features and Value

This article introduces the open-source project llm-serving-engine—a production-grade, local-first LLM inference engine that implements vLLM's core features from scratch (PagedAttention, Continuous Batching, custom Transformer forward pass). It supports running on Apple Silicon M2 and provides an OpenAI-compatible API. For developers who want to deeply understand LLM inference systems, it is an excellent learning resource: you can see every line of implementation code and understand internal mechanisms like KV Cache management and request scheduling.

Section 02

Background | Problems Solved by the Project and Design Intent

The llm-serving-engine was developed by SuStackx0 and open-sourced on GitHub. It aims to help developers deeply understand the internal working principles of LLM inference systems, rather than just calling APIs. Unlike using vLLM directly, this project provides complete implementation details and supports running on consumer-grade hardware (e.g., M2), meeting needs like edge deployment and privacy protection.

Section 03

Core Technologies | Implementation of PagedAttention, Continuous Batching, etc.

PagedAttention: Splits KV Cache into fixed blocks to eliminate memory fragmentation. A physical block manager maintains block lifecycles, and shared memory improves utilization. For example, using TinyLlama on M2, the KV Cache only takes up 176MB, with total memory around 4.6GB (including model weights of 4.4GB)
Continuous Batching: Dynamically schedules requests, divided into Prefill (parallel processing of input tokens) and Decode (generating outputs one by one) phases. It supports request preemption and maintains stability under high load
RoPE: Implements Rotary Position Embedding from scratch, including sine/cosine table caching and rotation matrix application, verifying relative position invariance
Custom Forward Pass: After loading HuggingFace weights, injects custom implementations. It is compatible with pre-trained models while providing optimization space and cross-platform portability

Section 04

API Service | OpenAI-Compatible Interfaces and Quick Start

The engine provides OpenAI-compatible interfaces via FastAPI, including /v1/chat/completions (streaming output), /v1/completions, /v1/models, etc. Quick start steps:

Install dependencies: pip install -r requirements.txt
Download model: python scripts/download_model.py
Start service: python scripts/run_server.py
Call API: Send requests using curl or OpenAI client without modifying code

Section 05

Performance Testing | Benchmark Results on M2

Test results for 8 concurrent requests on an M2 machine:

Total output tokens: 713
Total time: 41.3 seconds
Throughput: 17.3 tokens/second
TTFT(p50): 1721.5ms (time to first token)
TPOT(p50): 298.4ms (time per subsequent token)
KV block usage: All released after test (0/256). With a pure Python implementation, these metrics are impressive on consumer-grade hardware

Section 06

Application Scenarios | Who Can Benefit from This?

The project's value targets four groups:

AI system learners: Deeply understand internal mechanisms of LLM services (memory management, scheduling, attention, etc.)
Edge deployment developers: Supports MPS/CPU backends, controllable memory, suitable for resource-constrained environments
vLLM contributors: Simplified reference implementation to help understand vLLM's design and code structure
Privacy-sensitive scenarios: Local inference, data does not leave the device, ensuring sensitive information security

Section 07

Summary and Outlook | Project Significance and Future Potential

The llm-serving-engine proves that a production-grade LLM inference engine can be implemented with pure Python + PyTorch. Its complete implementation provides an excellent platform for learning, experimentation, and deployment. As LLM applications become more widespread, the demand for efficient and understandable inference systems grows—this project lays the foundation for future optimization and expansion, making it worth researching and trying

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15