Zing Forum

Reading

GPU Direct Storage Cold Start Optimization: LLM Serverless Inference Acceleration Solution

This project explores using NVIDIA GPUDirect Storage, CRIU container snapshots, and CUDA Checkpoint/Restore technologies to optimize cold start and inference performance for LLM serverless inference, aiming to achieve sub-second GPU state initialization.

GPU Direct StorageGDSCRIUCUDA Checkpoint冷启动优化无服务器LLM 推理vLLM容器快照GPU 状态恢复
Published 2026-06-04 15:11Recent activity 2026-06-04 15:30Estimated read 9 min
GPU Direct Storage Cold Start Optimization: LLM Serverless Inference Acceleration Solution
1

Section 01

[Introduction] GPU Direct Storage Cold Start Optimization: LLM Serverless Inference Acceleration Solution

This project aims to optimize cold start latency for LLM serverless inference by combining NVIDIA GPUDirect Storage (GDS), CRIU container snapshots, and CUDA Checkpoint/Restore technologies, with the goal of achieving sub-second GPU state initialization. The project is maintained by avaneesh1830 and open-sourced on GitHub (link: https://github.com/avaneesh1830/gpu-direct-storage-coldstarts), released on June 4, 2026. Currently, the project is in Week 1, conducting research on the NV Stack technology stack.

2

Section 02

Background: Cold Start Challenges in Serverless LLM Inference

Serverless computing brings advantages like pay-as-you-go, auto-scaling, and zero operation and maintenance for LLM inference, but cold start latency is a key bottleneck. When a function is not called for a long time and resources are reclaimed, reinitialization requires steps such as container startup, model loading (GB-level weights), GPU initialization, and inference preparation, taking tens of seconds or even minutes.

Existing solutions have limitations: pre-provisioned concurrency increases costs, model quantization may affect accuracy, layered loading is complex to implement, and CRIU snapshot recovery struggles to handle GPU states (CUDA context is coupled with hardware).

3

Section 03

Technical Route: Three Core Technologies and Project Plan

The project uses a three-layer technology stack for collaborative optimization:

  1. NVIDIA GDS: GPU reads data directly from NVMe SSDs with zero-copy and bypasses the CPU, accelerating model weight loading;
  2. CRIU: A user-space process snapshot tool that supports container state saving and fast recovery;
  3. CUDA Checkpoint/Restore: Captures GPU states (context, memory content), supports partial cross-GPU recovery, and integrates with CRIU.

8-week iteration plan:

Week Topic Status Description
1 NV Stack Overview 🚧 In Progress Research NVIDIA technology stack
2 LLM & Diffusion Model Baseline To Be Started Benchmarking for 8B/30B/120B models
3 InstantTensor Cross-GPU Benchmark To Be Started Testing across different GPU SKUs and PCIe generations
4 Container Checkpoint/Restore Ecosystem To Be Started Research container snapshot solutions
5 CRIU & CUDA Checkpoint To Be Started Implement GPU state snapshot
6 Dynamo Snapshot To Be Started PyTorch Dynamo integration
7 InstantTensor & vLLM Integration To Be Started SafeTensor loader/Omni integration
8 CuML/CuDF Exploration To Be Started Out-of-core execution and acceleration

Key milestones include InstantTensor (fast tensor serialization, GDS integration) and vLLM integration (SafeTensor optimization, continuous batching combined with snapshot recovery).

4

Section 04

Technical Challenges and Solutions

The project faces four major challenges and potential solutions:

  1. GPU State Portability: CUDA context is bound to hardware → Use CUDA virtual memory management APIs to abstract hardware details and reinitialize hardware-related parts during recovery;
  2. Large Model Weight Loading: 70B+ models reach 140GB+ → Layered loading (prioritize inference layers), asynchronous preloading, memory mapping;
  3. Balance Between Snapshot Size and Recovery Speed: Full snapshots are too large → Incremental snapshots, memory deduplication, compression algorithms;
  4. Framework Integration: Need seamless integration with vLLM/TensorRT-LLM → Common interface layer, upstream contributions, compatible branches.
5

Section 05

Application Scenarios: Four Practical Value Directions

The project can be applied to:

  1. Serverless LLM API Services: On-demand instance startup, sub-second response, cost reduction of over 10x;
  2. Edge Inference Devices: Fast model switching, on-demand loading of task models, reduced resident memory;
  3. Multi-tenant Inference Platforms: Fast context switching, isolated user states, improved GPU utilization;
  4. Elastic Scaling Clusters: K8s auto-scaling, fast instance startup to share load, save state during scaling down.
6

Section 06

Competitor Analysis and Project Innovations

Similar projects and their relations:

Project/Technology Features Relation to This Project
vLLM High-performance LLM inference engine Integration target
TensorRT-LLM NVIDIA-optimized inference library Potential integration
CRIU Process checkpoint/restore Core technology
NVIDIA GDS GPU direct storage Core technology
RunPod Serverless Commercial serverless LLM platform Application scenario
Banana.dev Serverless GPU inference Application scenario

Project innovations:

  1. First systematic combination of GDS+CRIU+CUDA Checkpoint technologies;
  2. Provides open-source reproducible solutions;
  3. Integration with popular open-source inference engine vLLM;
  4. Comprehensive benchmarking across model scales, GPU SKUs, and PCIe generations.
7

Section 07

Current Status and Participation Methods

Project Status: In Week 1 (NV Stack Overview) phase, under active development.

Participation Methods:

  1. Follow the GitHub repository for updates;
  2. Participate in technical route discussions in Issues;
  3. Submit PRs to help implement components;
  4. Provide benchmark results from different hardware environments.

Expected Outcomes:

  • Open-source cold start optimization toolchain;
  • Detailed performance benchmark report;
  • vLLM integration patch;
  • Technical documentation and best practice guide.