Zing Forum

Reading

Building an LLM Inference Engine from Scratch: Full Implementation of PagedAttention, Continuous Batching, and OpenAI-Compatible API

This article provides an in-depth analysis of the llm-serving-engine project—a production-grade LLM inference engine built from scratch that fully replicates vLLM's core features, including PagedAttention, Continuous Batching, custom Transformer forward pass, etc. It supports running on Apple Silicon M2 and offers an OpenAI-compatible API.

LLM推理引擎PagedAttentionContinuous BatchingvLLMFastAPIApple Silicon本地部署OpenAI APIKV Cache
Published 2026-05-18 19:45Recent activity 2026-05-18 19:48Estimated read 6 min
Building an LLM Inference Engine from Scratch: Full Implementation of PagedAttention, Continuous Batching, and OpenAI-Compatible API
1

Section 01

Main Floor | Building an LLM Inference Engine from Scratch: Overview of Core Features and Value

This article introduces the open-source project llm-serving-engine—a production-grade, local-first LLM inference engine that implements vLLM's core features from scratch (PagedAttention, Continuous Batching, custom Transformer forward pass). It supports running on Apple Silicon M2 and provides an OpenAI-compatible API. For developers who want to deeply understand LLM inference systems, it is an excellent learning resource: you can see every line of implementation code and understand internal mechanisms like KV Cache management and request scheduling.

2

Section 02

Background | Problems Solved by the Project and Design Intent

The llm-serving-engine was developed by SuStackx0 and open-sourced on GitHub. It aims to help developers deeply understand the internal working principles of LLM inference systems, rather than just calling APIs. Unlike using vLLM directly, this project provides complete implementation details and supports running on consumer-grade hardware (e.g., M2), meeting needs like edge deployment and privacy protection.

3

Section 03

Core Technologies | Implementation of PagedAttention, Continuous Batching, etc.

  • PagedAttention: Splits KV Cache into fixed blocks to eliminate memory fragmentation. A physical block manager maintains block lifecycles, and shared memory improves utilization. For example, using TinyLlama on M2, the KV Cache only takes up 176MB, with total memory around 4.6GB (including model weights of 4.4GB)
  • Continuous Batching: Dynamically schedules requests, divided into Prefill (parallel processing of input tokens) and Decode (generating outputs one by one) phases. It supports request preemption and maintains stability under high load
  • RoPE: Implements Rotary Position Embedding from scratch, including sine/cosine table caching and rotation matrix application, verifying relative position invariance
  • Custom Forward Pass: After loading HuggingFace weights, injects custom implementations. It is compatible with pre-trained models while providing optimization space and cross-platform portability
4

Section 04

API Service | OpenAI-Compatible Interfaces and Quick Start

The engine provides OpenAI-compatible interfaces via FastAPI, including /v1/chat/completions (streaming output), /v1/completions, /v1/models, etc. Quick start steps:

  1. Install dependencies: pip install -r requirements.txt
  2. Download model: python scripts/download_model.py
  3. Start service: python scripts/run_server.py
  4. Call API: Send requests using curl or OpenAI client without modifying code
5

Section 05

Performance Testing | Benchmark Results on M2

Test results for 8 concurrent requests on an M2 machine:

  • Total output tokens: 713
  • Total time: 41.3 seconds
  • Throughput: 17.3 tokens/second
  • TTFT(p50): 1721.5ms (time to first token)
  • TPOT(p50): 298.4ms (time per subsequent token)
  • KV block usage: All released after test (0/256). With a pure Python implementation, these metrics are impressive on consumer-grade hardware
6

Section 06

Application Scenarios | Who Can Benefit from This?

The project's value targets four groups:

  1. AI system learners: Deeply understand internal mechanisms of LLM services (memory management, scheduling, attention, etc.)
  2. Edge deployment developers: Supports MPS/CPU backends, controllable memory, suitable for resource-constrained environments
  3. vLLM contributors: Simplified reference implementation to help understand vLLM's design and code structure
  4. Privacy-sensitive scenarios: Local inference, data does not leave the device, ensuring sensitive information security
7

Section 07

Summary and Outlook | Project Significance and Future Potential

The llm-serving-engine proves that a production-grade LLM inference engine can be implemented with pure Python + PyTorch. Its complete implementation provides an excellent platform for learning, experimentation, and deployment. As LLM applications become more widespread, the demand for efficient and understandable inference systems grows—this project lays the foundation for future optimization and expansion, making it worth researching and trying