# Building from Scratch for 15x Speedup: A Technical Deep Dive into a Pure PyTorch LLM Inference Engine

> This article provides an in-depth analysis of an LLM inference engine built from scratch, which achieves a 15x throughput improvement over naive inference on a T4 GPU through three core technologies: continuous batching, paged KV cache, and dynamic injection.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-13T06:13:22.000Z
- 最近活动: 2026-06-13T06:19:36.546Z
- 热度: 163.9
- 关键词: LLM推理, PyTorch, KV缓存, 连续批处理, vLLM, GPU优化, 大语言模型, 推理引擎, T4 GPU, 开源项目
- 页面链接: https://www.zingnex.cn/en/forum/thread/15-pytorch-llm
- Canonical: https://www.zingnex.cn/forum/thread/15-pytorch-llm
- Markdown 来源: floors_fallback

---

## [Introduction] Pure PyTorch LLM Inference Engine: Three Core Technologies Behind the 15x Speedup

This article analyzes an open-source project of a pure PyTorch LLM inference engine built from scratch. It achieves a 15x throughput improvement over naive inference on a T4 GPU through three core technologies: continuous batching, paged KV cache, and dynamic injection. The project does not rely on black-box encapsulation; it disassembles core components like the scheduler and KV cache, providing developers with an opportunity to learn the underlying mechanisms of modern inference systems.

## Three Core Bottlenecks of Traditional LLM Inference

Traditional LLM inference faces three major challenges:
1. **GPU Idleness and Static Batching Inefficiency**: Static batching requires waiting for all requests in a batch to complete before starting a new one, leading to GPU resource waste;
2. **KV Cache Memory Fragmentation**: Naive implementations pre-allocate memory for the maximum sequence length for each request, causing over-allocation and memory waste;
3. **New Request Queuing Delay**: In static batching architectures, new requests must wait for the current batch to finish, leading to severe delay accumulation under high concurrency.

## Three Technical Pillars: Paged KV Cache, Continuous Batching, and Dynamic Injection

The project proposes three solutions to address these bottlenecks:
1. **Paged KV Cache**: Drawing inspiration from operating system virtual memory management, it divides the KV cache into fixed-size pages and dynamically allocates them on demand, improving memory utilization and reducing fragmentation;
2. **Continuous Batching**: The scheduler maintains a waiting queue and fills new requests immediately after existing ones complete, breaking batch boundaries and keeping the GPU busy;
3. **Dynamic Request Injection**: Allows injecting new requests during the decode phase, mixing prefill and decode tasks to fully utilize GPU computing power and memory bandwidth.

## Performance Test Results: Evidence of 15x Throughput Improvement on T4 GPU

On the T4 GPU in Google Colab, the project achieves significant performance improvements:
| Mode | Throughput |
|------|------------|
| Naive Inference (Single Request) | ~30 tokens/sec |
| This Engine (Continuous Batching, batch=8) | 458 tokens/sec |
| Performance Improvement | ~15x |
This result is achieved through the synergy of the three core technologies.

## System Architecture: Complete Flow from Request Arrival to Completion

Request processing flow:
1. A request enters the scheduler's waiting queue; the scheduler decides whether to add it to the current batch based on system load;
2. The memory manager dynamically allocates KV cache pages from the BlockPool;
3. The inference engine executes the prefill phase to generate the first token, then enters the decode loop—generating a new token each step while checking request status or injecting new requests.
The project has a clear code structure: `request.py` manages the request lifecycle, `scheduler.py` implements scheduling logic, `memory.py` handles paged KV cache, `continuous_engine.py` implements core inference, and `benchmark.py` is used for throughput testing.

## Engineering Insights: Key Understandings of GPU Utilization, Scheduler, and KV Cache

Core insights from the project author:
1. ****GPU Utilization**: High memory usage does not necessarily improve performance. If memory bandwidth is already a bottleneck, adding more KV cache blocks will exacerbate resource competition;
2. **Scheduler Importance**: The correctness of scheduling logic takes precedence over micro-optimizations; a well-designed scheduler keeps the system stable;
3. **KV Cache Paging Strategy**: It is not an optional optimization but a necessity for large-scale deployment, directly affecting the number of concurrent requests.

## Future Outlook and Community Value: From Student Project to Open-Source Learning Model

Future plans include: integrating the Flash Attention CUDA kernel, implementing speculative decoding, introducing INT8/FP16 quantization, and developing a streaming output API.
This project was independently completed by a BCA student, with the motivation to understand the underlying working principles of vLLM—embodying the spirit of learning from first principles in the open-source community.

## Conclusion: The Path from API Caller to System Understander

In today's rapidly evolving LLM technology landscape, engineers who deeply understand underlying systems have a competitive edge. This project provides clear code and documentation to help developers advance from 'API callers' to 'system understanders'. The 15x performance improvement is the result of deep problem understanding and careful engineering implementation.
