Reading

GPU Direct Storage Cold Start Optimization: LLM Serverless Inference Acceleration Solution

This project explores using NVIDIA GPUDirect Storage, CRIU container snapshots, and CUDA Checkpoint/Restore technologies to optimize cold start and inference performance for LLM serverless inference, aiming to achieve sub-second GPU state initialization.

GPU Direct StorageGDSCRIUCUDA Checkpoint冷启动优化无服务器LLM 推理vLLM容器快照GPU 状态恢复

Published 2026-06-04 15:11Recent activity 2026-06-04 15:30Estimated read 9 min

Section 01

[Introduction] GPU Direct Storage Cold Start Optimization: LLM Serverless Inference Acceleration Solution

This project aims to optimize cold start latency for LLM serverless inference by combining NVIDIA GPUDirect Storage (GDS), CRIU container snapshots, and CUDA Checkpoint/Restore technologies, with the goal of achieving sub-second GPU state initialization. The project is maintained by avaneesh1830 and open-sourced on GitHub (link: https://github.com/avaneesh1830/gpu-direct-storage-coldstarts), released on June 4, 2026. Currently, the project is in Week 1, conducting research on the NV Stack technology stack.

Section 02

Background: Cold Start Challenges in Serverless LLM Inference

Serverless computing brings advantages like pay-as-you-go, auto-scaling, and zero operation and maintenance for LLM inference, but cold start latency is a key bottleneck. When a function is not called for a long time and resources are reclaimed, reinitialization requires steps such as container startup, model loading (GB-level weights), GPU initialization, and inference preparation, taking tens of seconds or even minutes.

Existing solutions have limitations: pre-provisioned concurrency increases costs, model quantization may affect accuracy, layered loading is complex to implement, and CRIU snapshot recovery struggles to handle GPU states (CUDA context is coupled with hardware).

Section 03

Technical Route: Three Core Technologies and Project Plan

The project uses a three-layer technology stack for collaborative optimization:

NVIDIA GDS: GPU reads data directly from NVMe SSDs with zero-copy and bypasses the CPU, accelerating model weight loading;
CRIU: A user-space process snapshot tool that supports container state saving and fast recovery;
CUDA Checkpoint/Restore: Captures GPU states (context, memory content), supports partial cross-GPU recovery, and integrates with CRIU.

8-week iteration plan:

Week	Topic	Status	Description
1	NV Stack Overview	🚧 In Progress	Research NVIDIA technology stack
2	LLM & Diffusion Model Baseline	To Be Started	Benchmarking for 8B/30B/120B models
3	InstantTensor Cross-GPU Benchmark	To Be Started	Testing across different GPU SKUs and PCIe generations
4	Container Checkpoint/Restore Ecosystem	To Be Started	Research container snapshot solutions
5	CRIU & CUDA Checkpoint	To Be Started	Implement GPU state snapshot
6	Dynamo Snapshot	To Be Started	PyTorch Dynamo integration
7	InstantTensor & vLLM Integration	To Be Started	SafeTensor loader/Omni integration
8	CuML/CuDF Exploration	To Be Started	Out-of-core execution and acceleration

Key milestones include InstantTensor (fast tensor serialization, GDS integration) and vLLM integration (SafeTensor optimization, continuous batching combined with snapshot recovery).

Section 04

Technical Challenges and Solutions

The project faces four major challenges and potential solutions:

GPU State Portability: CUDA context is bound to hardware → Use CUDA virtual memory management APIs to abstract hardware details and reinitialize hardware-related parts during recovery;
Large Model Weight Loading: 70B+ models reach 140GB+ → Layered loading (prioritize inference layers), asynchronous preloading, memory mapping;
Balance Between Snapshot Size and Recovery Speed: Full snapshots are too large → Incremental snapshots, memory deduplication, compression algorithms;
Framework Integration: Need seamless integration with vLLM/TensorRT-LLM → Common interface layer, upstream contributions, compatible branches.

Section 05

Application Scenarios: Four Practical Value Directions

The project can be applied to:

Serverless LLM API Services: On-demand instance startup, sub-second response, cost reduction of over 10x;
Edge Inference Devices: Fast model switching, on-demand loading of task models, reduced resident memory;
Multi-tenant Inference Platforms: Fast context switching, isolated user states, improved GPU utilization;
Elastic Scaling Clusters: K8s auto-scaling, fast instance startup to share load, save state during scaling down.

Section 06

Competitor Analysis and Project Innovations

Similar projects and their relations:

Project/Technology	Features	Relation to This Project
vLLM	High-performance LLM inference engine	Integration target
TensorRT-LLM	NVIDIA-optimized inference library	Potential integration
CRIU	Process checkpoint/restore	Core technology
NVIDIA GDS	GPU direct storage	Core technology
RunPod Serverless	Commercial serverless LLM platform	Application scenario
Banana.dev	Serverless GPU inference	Application scenario

Project innovations:

First systematic combination of GDS+CRIU+CUDA Checkpoint technologies;
Provides open-source reproducible solutions;
Integration with popular open-source inference engine vLLM;
Comprehensive benchmarking across model scales, GPU SKUs, and PCIe generations.

Section 07

Current Status and Participation Methods

Project Status: In Week 1 (NV Stack Overview) phase, under active development.

Participation Methods:

Follow the GitHub repository for updates;
Participate in technical route discussions in Issues;
Submit PRs to help implement components;
Provide benchmark results from different hardware environments.

Expected Outcomes:

Open-source cold start optimization toolchain;
Detailed performance benchmark report;
vLLM integration patch;
Technical documentation and best practice guide.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49