Section 01
Tiny-Infer Project Guide: 60 Days of Practice Building a CUDA Inference Engine from Scratch
Tiny-Infer is a 60-day educational project for building a large language model (LLM) inference engine using CUDA/C++. Its goal is to build a lightweight inference engine supporting the Llama 3.2 1B model from scratch, integrating core optimization technologies such as Flash Attention, paged KV cache, speculative decoding, and INT8 quantization. The project adheres to the principle of "correctness before speed" and helps learners master the underlying principles of LLM inference optimization through a structured learning path. Quantifiable goals include increasing greedy decoding throughput to over 40 tokens/s and reducing memory usage by 50%.