# DAK: A Direct-Access GPU Memory Offloading Framework for LLM Inference

> The DAK framework replaces prefetching strategies with direct GPU access to remote memory, uses Tensor Memory Accelerator to enable asynchronous loading of weights and KV caches, and achieves a 3x performance improvement on NVLink-C2C.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-28T19:30:47.000Z
- 最近活动: 2026-04-30T02:52:35.037Z
- 热度: 119.6
- 关键词: GPU memory offloading, LLM inference, Tensor Memory Accelerator, NVLink-C2C, tiered memory, KV cache, direct memory access, inference optimization
- 页面链接: https://www.zingnex.cn/en/forum/thread/dak-llmgpu
- Canonical: https://www.zingnex.cn/forum/thread/dak-llmgpu
- Markdown 来源: floors_fallback

---

## DAK Framework Guide: Direct-Access GPU Memory Offloading Solution for LLM Inference

The DAK framework replaces prefetching strategies with direct GPU access to remote memory, uses Tensor Memory Accelerator (TMA) to enable asynchronous loading of weights and KV caches, achieves a 3x performance improvement on NVLink-C2C, and solves memory bottleneck issues in LLM inference.

## Memory Bottlenecks in LLM Inference and Defects of Prefetching Strategies

Large language model inference faces GPU memory capacity and bandwidth constraints. Hierarchical memory architectures offload part of the data to remote memory layers, but existing prefetching strategies have three major drawbacks: HBM contention leading to bandwidth fragmentation, memory capacity waste limiting sequence length and concurrency, and prefetching and computation being serial introducing pipeline bubbles.

## Core Innovations of DAK: Direct Remote Memory Access and TMA Reutilization

DAK proposes an architectural shift where GPUs directly access remote memory, reusing NVIDIA Hopper architecture's TMA hardware unit: asynchronously loading remote weights and KV caches into SMEM, bypassing HBM transfer to avoid contention, and fully overlapping loading and computation to eliminate pipeline bubbles.

## Optimization Strategies of DAK: Offloading Ratio Decision and Congestion Control

DAK uses a greedy algorithm to determine the optimal offloading ratio for operators (considering computational intensity, data reuse patterns, and interconnection bandwidth); dynamically adjusts access rates through active congestion control, and uses TMA multicast to eliminate bandwidth waste from repeated reads in data parallel scenarios.

## DAK Performance Evaluation: Near-Theoretical Optimal Improvement Effects

Achieves up to 3x performance improvement on NVLink-C2C systems and 1.8x acceleration on PCIe systems; aggregated system bandwidth utilization is close to the theoretical upper limit, far higher than the prefetching strategy's utilization of less than 50%.

## Implications and Significance of DAK for LLM Inference Deployment

DAK provides a new paradigm for memory capacity expansion (using remote memory to reduce demand for high-end GPUs), breaks through long sequence processing bottlenecks (dynamically loading KV caches), and verifies the potential of heterogeneous computing; it challenges inherent prefetching concepts and provides an optimization path for cost-sensitive inference services.

## Limitations of DAK and Future Research Directions

Limitations: relies on Hopper architecture's TMA, complex software stack, power consumption characteristics to be studied; future directions: expand to multi-node RDMA scenarios, combine predictive loading to optimize latency, explore collaborative optimization with CXL 3.0 memory pooling.
