Thaw's restoration process (thaw) uses a sophisticated pipeline architecture to maximize hardware bandwidth utilization:
Step 1: Virtual Initialization . The system first quickly initializes vLLM with virtual weights, skipping time-consuming disk I/O. This step is almost instantaneous, allowing the service framework to enter the ready state immediately.
Step 2: Double-Buffered Pipeline DMA . Thaw uses two CUDA streams for pipeline transmission:
- One stream reads snapshot data from NVMe to pinned host memory
- The other stream asynchronously transfers data from host memory to the GPU
The two streams work in parallel, overlapping disk reading and PCIe transmission, eliminating waiting time in traditional serial processes. The O_DIRECT flag bypasses the kernel page cache to further reduce memory copy overhead.
Step 3: KV Cache Reconstruction . After weight restoration, KV cache blocks are restored to the GPU via an independent DMA channel, while the prefix cache's hash table is rebuilt. This allows new requests to hit the cache immediately, skipping expensive prefill computations.