Zing Forum

Reading

CUDA 90-Day Intensive Challenge: Building Production-Grade LLM Inference Infrastructure with Rust and C++

A systematic 90-day learning plan exploring how to write native GPU kernel functions using Rust and CUDA C++, and build memory-safe, high-concurrency AI inference systems.

CUDAGPU编程RustAI基础设施LLM推理高性能计算cuda-oxideSGLangCandlePyTorch
Published 2026-06-10 15:15Recent activity 2026-06-10 15:23Estimated read 7 min
CUDA 90-Day Intensive Challenge: Building Production-Grade LLM Inference Infrastructure with Rust and C++
1

Section 01

Introduction to the CUDA 90-Day Intensive Challenge Project

This project is a 90-day AI infrastructure challenge initiated by wenfeizou, aiming to transition from system development to the field of AI infrastructure and high-performance inference engines. Focused on practice, the project explores building memory-safe, high-concurrency production-grade LLM inference systems using Rust and CUDA C++ through runnable code, benchmark tests, and performance analysis. This thread will introduce the project background, technical roadmap, learning roadmap, repository structure, and key insights in detail across different floors.

2

Section 02

Project Background: Transition from System Development to AI Infrastructure

With the rapid development of LLMs, AI infrastructure has become a popular field, but engineers capable of developing high-performance inference systems are scarce. This project documents the author's transition from system development to AI infrastructure, emphasizing practice first: write fewer vague notes, and leave more runnable code, benchmark, and profiling records. The project is not just study notes but also engineering experiment records.

3

Section 03

Core Technical Roadmap: Rust + CUDA C++ Dual-Track Parallelism

Reasons for Choosing Rust: Memory safety (avoids errors at compile time), zero-cost abstractions (performance close to C++), modern toolchain (Cargo), FFI capabilities (interoperability with C++). The core experiments use the cuda-oxide crate to implement Rust native GPU kernel functions. Importance of CUDA C++: Need to master thread hierarchy, memory hierarchy, warp execution model, and performance optimization techniques (e.g., coalesced memory access, avoiding bank conflicts) to understand GPU architecture and reuse existing code.

4

Section 04

90-Day Roadmap: From Kernel to Full-Link Closed Loop

The roadmap is divided into three phases:

  1. CUDA Kernel Basics: Vector addition, matrix multiplication, memory optimization (shared memory/coalesced access), reduction algorithms, convolution operations.
  2. Rust GPU Programming: cuda-oxide basics, GPU memory management, Rust-C++ CUDA interoperability, asynchronous execution (async/await + CUDA streams).
  3. LLM Inference Infrastructure: Transformer operator optimization, KV Cache management, dynamic batch scheduling, distributed inference architecture.
5

Section 05

Repository Structure and Analysis of Support Capability Layers

Repository Structure: Separated by concerns, including directories like days (daily experiments), kernels (C++/Rust kernel functions), frameworks (PyTorch/Candle), runtime (SGLang), infra (support layer), benchmarks (performance tests), etc. Support Layers:

  • Linux: Driver installation, Nsight tools, dynamic library management, performance observation.
  • C++: CMake build, memory model, template programming, Host/Device code organization.
  • Rust: Unsafe code, ownership management, FFI, asynchronous runtime.
  • Python: PyTorch baseline verification, data generation, correctness checking.
6

Section 06

Key Tools and Experimental Environment Configuration

Key Tools:

  • SGLang: High-performance inference runtime with features like structured generation, RadixAttention, request scheduling; learning value includes mastering serving system design, KV Cache management, etc.
  • PyTorch: Used as a correctness verification baseline and performance comparison, learning CUDA Extension and compiler technologies.
  • Candle: Hugging Face's Rust-native framework, learning tensor operations, model loading, CUDA backend integration. Experimental Environment: Ubuntu26.04 LTS, CUDA13.3, Rust1.98+, tools including Nsight Systems/Compute.
7

Section 07

Learning Insights and Project Summary

Learning Insights:

  1. Practice First: Write code, run experiments, do analysis, and understand performance through benchmarks and profiling.
  2. Systems Thinking: Need to master full-stack knowledge, focus on performance, engineering quality, and continuous learning.
  3. Rust's Potential: Memory safety, high performance, concurrency support—combined with frameworks like Candle, it has broad prospects in the AI infra field. Summary: The project provides a clear roadmap, engineering learning methods, and a complete technology stack, which is of great value to AI infrastructure learners. It is recommended to follow the project and explore the possibilities of Rust and CUDA through this intensive challenge journey.