Reading

RH+ Scheduling: A New Breakthrough in Row-Hit Optimization for LLM Inference on PIM Architectures

This article reveals that the real bottleneck of LLM inference on PIM architectures lies in DRAM row cycle time (nRC) rather than the previously thought nCCDAB. It proposes the RH+ scheduling strategy, which achieves 32 consecutive MAC operations within the same row through simple step adjustment, resulting in an 8-12x speedup and a 74% energy reduction.

PIM架构存内计算LLM推理DRAM优化地址映射能效优化

Published 2026-06-04 07:33Recent activity 2026-06-05 14:53Estimated read 5 min

Section 01

RH+ Scheduling: A New Breakthrough in Row-Hit Optimization for LLM Inference on PIM Architectures (Introduction)

This article reveals that the real bottleneck of LLM inference on PIM architectures is DRAM row cycle time (nRC) rather than the previously assumed nCCDAB. It proposes the RH+ scheduling strategy, which enables 32 consecutive MAC operations to be executed within the same row via simple step adjustment. This results in an 8-12x speedup, over 74% energy reduction, and a 52x improvement in EDP (Energy Delay Product), while being compatible with existing HBM3 specifications without requiring hardware modifications.

Section 02

Background: The Memory Wall Problem and the Rise of PIM Architectures

The exponential growth of parameters in large language models has led to the "memory wall" bottleneck in traditional von Neumann architectures. Processing-in-Memory (PIM) architectures break this bottleneck by performing MAC operations inside DRAM. HBM3 already supports PIM functionality, but previous studies mistakenly identified nCCDAB as the main bottleneck. This article finds that nRC, which is 10-11 times larger than nCCDAB, is the real bottleneck.

Section 03

Root Cause of the Problem: Drawbacks of Host-Centered Address Interleaving

Existing PIM systems use host-centered address interleaving, which scatters consecutive MAC operations across different DRAM rows. This causes each full-bank MAC command to trigger expensive row-switching operations such as precharging and activation, whose time overhead far exceeds that of the computation itself.

Section 04

RH+ Scheduling Strategy: Core Design with Simple Step Adjustment

RH+ scheduling adjusts the access step to keep 32 consecutive MAC operations within the same DRAM row (adapting to the 32 MAC units per bank feature of HBM3). It requires no hardware modifications or additional storage, is compatible with HBM3 specifications, and maintains parallelism by leveraging the row-hit advantage (no extra delay after one activation).

Section 05

Experimental Validation: Performance and Energy Efficiency Improvement Data of RH+

Results from cycle-accurate simulator validation:

8-12x execution speedup;
Over 74% energy reduction;
52x improvement in EDP (Energy Delay Product).

Section 06

Practical Insights: Key Directions for PIM System Design

Insights from RH+:

Address mapping needs to be customized based on workload access patterns;
Hardware-software co-design is crucial;
Cycle-accurate simulation is a necessary means to identify core bottlenecks.

Section 07

Limitations and Future Research Directions

Limitations of RH+ and future explorations:

Extending to multi-bank parallel scenarios;
Adapting to other operations like attention computation in LLM inference;
Validation on actual HBM3 PIM hardware.

Section 08

Conclusion: Value and Core Insights of RH+

RH+ achieves breakthrough optimization through precise identification of the real bottleneck (nRC) and simple step adjustment. Its success proves that understanding the core bottleneck of the system is more important than complex optimizations, and simple solutions can often effectively unleash hardware potential, providing key optimization ideas for LLM inference on PIM architectures.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49