Zing Forum

Reading

Chiplet-Contiguous Layout: A New Scheme for Optimizing Multi-Chiplet GPU Memory Layout for LLM Inference

This article introduces the Chiplet-Contiguous Layout technology, which solves the incompatibility between locality-aware data placement and fixed page-granularity data interleaving in multi-chiplet GPUs by storing chiplet-local data contiguously. It achieves significant reduction in remote HBM traffic for GEMM workloads of Qwen 3 30B and Llama 3.1 70B models.

多芯粒GPUGEMM优化内存布局LLM推理HBM数据局部性Chiplet-Contiguous Layout
Published 2026-06-10 14:47Recent activity 2026-06-11 10:19Estimated read 6 min
Chiplet-Contiguous Layout: A New Scheme for Optimizing Multi-Chiplet GPU Memory Layout for LLM Inference
1

Section 01

[Introduction] Chiplet-Contiguous Layout: A New Scheme for Optimizing Multi-Chiplet GPU Memory Layout

Core Point: This article proposes the Chiplet-Contiguous Layout technology, which solves the incompatibility between locality-aware data placement and fixed page-granularity data interleaving in multi-chiplet GPUs by storing chiplet-local data contiguously. It achieves significant reduction in remote HBM traffic for GEMM workloads of Qwen 3 30B and Llama 3.1 70B models.

Original Author and Source:

  • Original Author/Maintainer: arXiv authors
  • Source Platform: arXiv
  • Original Title: Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs
  • Original Link: http://arxiv.org/abs/2606.11718v1
  • Source Publication/Update Time: 2026-06-10T06:47:27Z
2

Section 02

Background: Memory Challenges of Multi-Chiplet GPUs

As LLM scales grow, multi-chiplet GPU architectures expand computational throughput and HBM capacity but introduce NUMA characteristics: accessing remote HBM has higher latency and energy consumption.

GEMM is a core operator for LLM inference/training, where data locality is crucial (should mainly access local HBM). However, the traditional 4KB page interleaving strategy cannot adapt to the optimal data placement requirements of GEMM (the optimal granularity varies greatly for different GEMM shapes).

3

Section 03

Method: Core Ideas and Implementation of Chiplet-Contiguous Layout

Core Innovation: Store local data of each chiplet contiguously in the physical address space (traditional interleaving layout scatters data).

Advantages:

  1. Compatibility: No need to modify OS or hardware
  2. Flexibility: Applicable to various LLM GEMM shapes
  3. Locality Awareness: Naturally matches data with compute chiplets

Implementation Mechanisms:

  • Data Partitioning: Logically partition matrix data by the number of chiplets, with subsets stored contiguously
  • Address Mapping: Adjust virtual-to-physical address mapping to ensure local access
  • Integration: Pure software optimization, can be seamlessly integrated into frameworks like PyTorch/TensorFlow
4

Section 04

Evidence: Experimental Results and Performance Analysis

Experimental Objects: GEMM workloads of Qwen 3 30B and Llama 3.1 70B models

Reduction Effect on Remote HBM Traffic:

  • Compared to 4KB page interleaving: Qwen 3 30B reduced by 24.7x, Llama 3.1 70B reduced by 19.2x
  • Compared to coarse-grained locality-aware placement: Qwen 3 30B reduced by 4.1x, Llama 3.1 70B reduced by 2.1x

Explanation: Significantly reduces data migration between chiplets and improves memory access efficiency.

5

Section 05

Conclusion: Practical Significance and Core Insights

Practical Significance:

  1. AI Infrastructure: Provides key optimization for efficient inference on multi-chiplet GPUs, reducing costs and improving response speed
  2. Deployment-Friendly: No need for hardware/OS modifications, can be quickly applied to existing GPU clusters (e.g., NVIDIA Hopper and subsequent architectures)
  3. Cross-Model Generalization: Effective on Qwen and Llama series, applicable to different Transformer architectures and scales

Core Insight: Data layout optimization is a key lever to improve the performance of heterogeneous memory systems, sometimes yielding greater benefits than algorithmic optimization.

6

Section 06

Suggestions: Limitations and Future Directions

Limitations:

  1. Generality: Currently only validated for GEMM operations; applicability to other operators (e.g., attention sparse computation) needs verification
  2. Dynamic Workloads: Adaptive layout under dynamic batching/sequence lengths is an open problem
  3. Compiler Collaboration: In-depth research is needed on collaboration with GPU compiler automatic optimizations (operator fusion, memory reuse)

Future Directions: Conduct research on the above limitations to further improve the technology's applicability and effectiveness.