Zing Forum

Reading

DAK: A Direct-Access GPU Memory Offloading Framework for LLM Inference

The DAK framework replaces prefetching strategies with direct GPU access to remote memory, uses Tensor Memory Accelerator to enable asynchronous loading of weights and KV caches, and achieves a 3x performance improvement on NVLink-C2C.

GPU memory offloadingLLM inferenceTensor Memory AcceleratorNVLink-C2Ctiered memoryKV cachedirect memory accessinference optimization
Published 2026-04-29 03:30Recent activity 2026-04-30 10:52Estimated read 4 min
DAK: A Direct-Access GPU Memory Offloading Framework for LLM Inference
1

Section 01

DAK Framework Guide: Direct-Access GPU Memory Offloading Solution for LLM Inference

The DAK framework replaces prefetching strategies with direct GPU access to remote memory, uses Tensor Memory Accelerator (TMA) to enable asynchronous loading of weights and KV caches, achieves a 3x performance improvement on NVLink-C2C, and solves memory bottleneck issues in LLM inference.

2

Section 02

Memory Bottlenecks in LLM Inference and Defects of Prefetching Strategies

Large language model inference faces GPU memory capacity and bandwidth constraints. Hierarchical memory architectures offload part of the data to remote memory layers, but existing prefetching strategies have three major drawbacks: HBM contention leading to bandwidth fragmentation, memory capacity waste limiting sequence length and concurrency, and prefetching and computation being serial introducing pipeline bubbles.

3

Section 03

Core Innovations of DAK: Direct Remote Memory Access and TMA Reutilization

DAK proposes an architectural shift where GPUs directly access remote memory, reusing NVIDIA Hopper architecture's TMA hardware unit: asynchronously loading remote weights and KV caches into SMEM, bypassing HBM transfer to avoid contention, and fully overlapping loading and computation to eliminate pipeline bubbles.

4

Section 04

Optimization Strategies of DAK: Offloading Ratio Decision and Congestion Control

DAK uses a greedy algorithm to determine the optimal offloading ratio for operators (considering computational intensity, data reuse patterns, and interconnection bandwidth); dynamically adjusts access rates through active congestion control, and uses TMA multicast to eliminate bandwidth waste from repeated reads in data parallel scenarios.

5

Section 05

DAK Performance Evaluation: Near-Theoretical Optimal Improvement Effects

Achieves up to 3x performance improvement on NVLink-C2C systems and 1.8x acceleration on PCIe systems; aggregated system bandwidth utilization is close to the theoretical upper limit, far higher than the prefetching strategy's utilization of less than 50%.

6

Section 06

Implications and Significance of DAK for LLM Inference Deployment

DAK provides a new paradigm for memory capacity expansion (using remote memory to reduce demand for high-end GPUs), breaks through long sequence processing bottlenecks (dynamically loading KV caches), and verifies the potential of heterogeneous computing; it challenges inherent prefetching concepts and provides an optimization path for cost-sensitive inference services.

7

Section 07

Limitations of DAK and Future Research Directions

Limitations: relies on Hopper architecture's TMA, complex software stack, power consumption characteristics to be studied; future directions: expand to multi-node RDMA scenarios, combine predictive loading to optimize latency, explore collaborative optimization with CXL 3.0 memory pooling.