Reading

Chiplet-Contiguous Layout: A New Scheme for Optimizing Multi-Chiplet GPU Memory Layout for LLM Inference

This article introduces the Chiplet-Contiguous Layout technology, which solves the incompatibility between locality-aware data placement and fixed page-granularity data interleaving in multi-chiplet GPUs by storing chiplet-local data contiguously. It achieves significant reduction in remote HBM traffic for GEMM workloads of Qwen 3 30B and Llama 3.1 70B models.

多芯粒GPUGEMM优化内存布局LLM推理HBM数据局部性Chiplet-Contiguous Layout

Published 2026-06-10 14:47Recent activity 2026-06-11 10:19Estimated read 6 min

Chiplet-Contiguous Layout: A New Scheme for Optimizing Multi-Chiplet GPU Memory Layout for LLM Inference

Section 01

[Introduction] Chiplet-Contiguous Layout: A New Scheme for Optimizing Multi-Chiplet GPU Memory Layout

Core Point: This article proposes the Chiplet-Contiguous Layout technology, which solves the incompatibility between locality-aware data placement and fixed page-granularity data interleaving in multi-chiplet GPUs by storing chiplet-local data contiguously. It achieves significant reduction in remote HBM traffic for GEMM workloads of Qwen 3 30B and Llama 3.1 70B models.

Original Author and Source:

Original Author/Maintainer: arXiv authors
Source Platform: arXiv
Original Title: Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs
Original Link: http://arxiv.org/abs/2606.11718v1
Source Publication/Update Time: 2026-06-10T06:47:27Z

Section 02

Background: Memory Challenges of Multi-Chiplet GPUs

As LLM scales grow, multi-chiplet GPU architectures expand computational throughput and HBM capacity but introduce NUMA characteristics: accessing remote HBM has higher latency and energy consumption.

GEMM is a core operator for LLM inference/training, where data locality is crucial (should mainly access local HBM). However, the traditional 4KB page interleaving strategy cannot adapt to the optimal data placement requirements of GEMM (the optimal granularity varies greatly for different GEMM shapes).

Section 03

Method: Core Ideas and Implementation of Chiplet-Contiguous Layout

Core Innovation: Store local data of each chiplet contiguously in the physical address space (traditional interleaving layout scatters data).

Advantages:

Compatibility: No need to modify OS or hardware
Flexibility: Applicable to various LLM GEMM shapes
Locality Awareness: Naturally matches data with compute chiplets

Implementation Mechanisms:

Data Partitioning: Logically partition matrix data by the number of chiplets, with subsets stored contiguously
Address Mapping: Adjust virtual-to-physical address mapping to ensure local access
Integration: Pure software optimization, can be seamlessly integrated into frameworks like PyTorch/TensorFlow

Section 04

Evidence: Experimental Results and Performance Analysis

Experimental Objects: GEMM workloads of Qwen 3 30B and Llama 3.1 70B models

Reduction Effect on Remote HBM Traffic:

Compared to 4KB page interleaving: Qwen 3 30B reduced by 24.7x, Llama 3.1 70B reduced by 19.2x
Compared to coarse-grained locality-aware placement: Qwen 3 30B reduced by 4.1x, Llama 3.1 70B reduced by 2.1x

Explanation: Significantly reduces data migration between chiplets and improves memory access efficiency.

Section 05

Conclusion: Practical Significance and Core Insights

Practical Significance:

AI Infrastructure: Provides key optimization for efficient inference on multi-chiplet GPUs, reducing costs and improving response speed
Deployment-Friendly: No need for hardware/OS modifications, can be quickly applied to existing GPU clusters (e.g., NVIDIA Hopper and subsequent architectures)
Cross-Model Generalization: Effective on Qwen and Llama series, applicable to different Transformer architectures and scales

Core Insight: Data layout optimization is a key lever to improve the performance of heterogeneous memory systems, sometimes yielding greater benefits than algorithmic optimization.

Section 06

Suggestions: Limitations and Future Directions

Limitations:

Generality: Currently only validated for GEMM operations; applicability to other operators (e.g., attention sparse computation) needs verification
Dynamic Workloads: Adaptive layout under dynamic batching/sequence lengths is an open problem
Compiler Collaboration: In-depth research is needed on collaboration with GPU compiler automatic optimizations (operator fusion, memory reuse)

Future Directions: Conduct research on the above limitations to further improve the technology's applicability and effectiveness.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23