# GPU Direct Storage Cold Start Optimization: LLM Serverless Inference Acceleration Solution

> This project explores using NVIDIA GPUDirect Storage, CRIU container snapshots, and CUDA Checkpoint/Restore technologies to optimize cold start and inference performance for LLM serverless inference, aiming to achieve sub-second GPU state initialization.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T07:11:48.000Z
- 最近活动: 2026-06-04T07:30:03.527Z
- 热度: 154.7
- 关键词: GPU Direct Storage, GDS, CRIU, CUDA Checkpoint, 冷启动优化, 无服务器, LLM 推理, vLLM, 容器快照, GPU 状态恢复
- 页面链接: https://www.zingnex.cn/en/forum/thread/gpu-direct-storage-llm
- Canonical: https://www.zingnex.cn/forum/thread/gpu-direct-storage-llm
- Markdown 来源: floors_fallback

---

## [Introduction] GPU Direct Storage Cold Start Optimization: LLM Serverless Inference Acceleration Solution

This project aims to optimize cold start latency for LLM serverless inference by combining NVIDIA GPUDirect Storage (GDS), CRIU container snapshots, and CUDA Checkpoint/Restore technologies, with the goal of achieving sub-second GPU state initialization. The project is maintained by avaneesh1830 and open-sourced on GitHub (link: https://github.com/avaneesh1830/gpu-direct-storage-coldstarts), released on June 4, 2026. Currently, the project is in Week 1, conducting research on the NV Stack technology stack.

## Background: Cold Start Challenges in Serverless LLM Inference

Serverless computing brings advantages like pay-as-you-go, auto-scaling, and zero operation and maintenance for LLM inference, but cold start latency is a key bottleneck. When a function is not called for a long time and resources are reclaimed, reinitialization requires steps such as container startup, model loading (GB-level weights), GPU initialization, and inference preparation, taking tens of seconds or even minutes.

Existing solutions have limitations: pre-provisioned concurrency increases costs, model quantization may affect accuracy, layered loading is complex to implement, and CRIU snapshot recovery struggles to handle GPU states (CUDA context is coupled with hardware).

## Technical Route: Three Core Technologies and Project Plan

The project uses a three-layer technology stack for collaborative optimization:
1. **NVIDIA GDS**: GPU reads data directly from NVMe SSDs with zero-copy and bypasses the CPU, accelerating model weight loading;
2. **CRIU**: A user-space process snapshot tool that supports container state saving and fast recovery;
3. **CUDA Checkpoint/Restore**: Captures GPU states (context, memory content), supports partial cross-GPU recovery, and integrates with CRIU.

8-week iteration plan:
| Week | Topic | Status | Description |
|------|------|------|------|
| 1 | NV Stack Overview | 🚧 In Progress | Research NVIDIA technology stack |
| 2 | LLM & Diffusion Model Baseline | To Be Started | Benchmarking for 8B/30B/120B models |
| 3 | InstantTensor Cross-GPU Benchmark | To Be Started | Testing across different GPU SKUs and PCIe generations |
| 4 | Container Checkpoint/Restore Ecosystem | To Be Started | Research container snapshot solutions |
| 5 | CRIU & CUDA Checkpoint | To Be Started | Implement GPU state snapshot |
| 6 | Dynamo Snapshot | To Be Started | PyTorch Dynamo integration |
| 7 | InstantTensor & vLLM Integration | To Be Started | SafeTensor loader/Omni integration |
| 8 | CuML/CuDF Exploration | To Be Started | Out-of-core execution and acceleration |

Key milestones include InstantTensor (fast tensor serialization, GDS integration) and vLLM integration (SafeTensor optimization, continuous batching combined with snapshot recovery).

## Technical Challenges and Solutions

The project faces four major challenges and potential solutions:
1. **GPU State Portability**: CUDA context is bound to hardware → Use CUDA virtual memory management APIs to abstract hardware details and reinitialize hardware-related parts during recovery;
2. **Large Model Weight Loading**: 70B+ models reach 140GB+ → Layered loading (prioritize inference layers), asynchronous preloading, memory mapping;
3. **Balance Between Snapshot Size and Recovery Speed**: Full snapshots are too large → Incremental snapshots, memory deduplication, compression algorithms;
4. **Framework Integration**: Need seamless integration with vLLM/TensorRT-LLM → Common interface layer, upstream contributions, compatible branches.

## Application Scenarios: Four Practical Value Directions

The project can be applied to:
1. **Serverless LLM API Services**: On-demand instance startup, sub-second response, cost reduction of over 10x;
2. **Edge Inference Devices**: Fast model switching, on-demand loading of task models, reduced resident memory;
3. **Multi-tenant Inference Platforms**: Fast context switching, isolated user states, improved GPU utilization;
4. **Elastic Scaling Clusters**: K8s auto-scaling, fast instance startup to share load, save state during scaling down.

## Competitor Analysis and Project Innovations

Similar projects and their relations:
| Project/Technology | Features | Relation to This Project |
|-----------|------|--------------|
| vLLM | High-performance LLM inference engine | Integration target |
| TensorRT-LLM | NVIDIA-optimized inference library | Potential integration |
| CRIU | Process checkpoint/restore | Core technology |
| NVIDIA GDS | GPU direct storage | Core technology |
| RunPod Serverless | Commercial serverless LLM platform | Application scenario |
| Banana.dev | Serverless GPU inference | Application scenario |

Project innovations:
1. First systematic combination of GDS+CRIU+CUDA Checkpoint technologies;
2. Provides open-source reproducible solutions;
3. Integration with popular open-source inference engine vLLM;
4. Comprehensive benchmarking across model scales, GPU SKUs, and PCIe generations.

## Current Status and Participation Methods

**Project Status**: In Week 1 (NV Stack Overview) phase, under active development.

**Participation Methods**:
1. Follow the GitHub repository for updates;
2. Participate in technical route discussions in Issues;
3. Submit PRs to help implement components;
4. Provide benchmark results from different hardware environments.

**Expected Outcomes**:
- Open-source cold start optimization toolchain;
- Detailed performance benchmark report;
- vLLM integration patch;
- Technical documentation and best practice guide.
