The course adopts a 'depth-first, breadth-progressive' design philosophy, with a 28-month learning cycle divided into four progressive stages:
Stage 1: Foundation Building (Months 1-6)
Establish underlying cognition, covering GPU architecture and CUDA programming, linear algebra and numerical computing, and deep learning basics. It includes supporting theoretical explanations, code implementation, and performance analysis assignments.
Stage 2: Inference Engine (Months 7-14)
Focus on full-stack optimization of LLM inference, including model compilation and graph optimization, operator optimization and kernel development, memory management and KV Cache optimization, quantization and compression. Students need to hands-on implement a simplified inference engine.
Stage 3: Distributed Systems (Months 15-22)
Covers data/model parallelism, service orchestration and scheduling, and inference serviceization. The assignment is to build a multi-card parallel inference service cluster and conduct stress testing.
Stage 4: Production Practice (Months 23-28)
Integrate full-stack performance tuning, observability and debugging, cost optimization and energy efficiency. The graduation project requires contributing to an open-source inference framework or implementing innovative optimization features.