Zing Forum

Reading

RH+ Scheduling: A New Breakthrough in Row-Hit Optimization for LLM Inference on PIM Architectures

This article reveals that the real bottleneck of LLM inference on PIM architectures lies in DRAM row cycle time (nRC) rather than the previously thought nCCDAB. It proposes the RH+ scheduling strategy, which achieves 32 consecutive MAC operations within the same row through simple step adjustment, resulting in an 8-12x speedup and a 74% energy reduction.

PIM架构存内计算LLM推理DRAM优化地址映射能效优化
Published 2026-06-04 07:33Recent activity 2026-06-05 14:53Estimated read 5 min
RH+ Scheduling: A New Breakthrough in Row-Hit Optimization for LLM Inference on PIM Architectures
1

Section 01

RH+ Scheduling: A New Breakthrough in Row-Hit Optimization for LLM Inference on PIM Architectures (Introduction)

This article reveals that the real bottleneck of LLM inference on PIM architectures is DRAM row cycle time (nRC) rather than the previously assumed nCCDAB. It proposes the RH+ scheduling strategy, which enables 32 consecutive MAC operations to be executed within the same row via simple step adjustment. This results in an 8-12x speedup, over 74% energy reduction, and a 52x improvement in EDP (Energy Delay Product), while being compatible with existing HBM3 specifications without requiring hardware modifications.

2

Section 02

Background: The Memory Wall Problem and the Rise of PIM Architectures

The exponential growth of parameters in large language models has led to the "memory wall" bottleneck in traditional von Neumann architectures. Processing-in-Memory (PIM) architectures break this bottleneck by performing MAC operations inside DRAM. HBM3 already supports PIM functionality, but previous studies mistakenly identified nCCDAB as the main bottleneck. This article finds that nRC, which is 10-11 times larger than nCCDAB, is the real bottleneck.

3

Section 03

Root Cause of the Problem: Drawbacks of Host-Centered Address Interleaving

Existing PIM systems use host-centered address interleaving, which scatters consecutive MAC operations across different DRAM rows. This causes each full-bank MAC command to trigger expensive row-switching operations such as precharging and activation, whose time overhead far exceeds that of the computation itself.

4

Section 04

RH+ Scheduling Strategy: Core Design with Simple Step Adjustment

RH+ scheduling adjusts the access step to keep 32 consecutive MAC operations within the same DRAM row (adapting to the 32 MAC units per bank feature of HBM3). It requires no hardware modifications or additional storage, is compatible with HBM3 specifications, and maintains parallelism by leveraging the row-hit advantage (no extra delay after one activation).

5

Section 05

Experimental Validation: Performance and Energy Efficiency Improvement Data of RH+

Results from cycle-accurate simulator validation:

  1. 8-12x execution speedup;
  2. Over 74% energy reduction;
  3. 52x improvement in EDP (Energy Delay Product).
6

Section 06

Practical Insights: Key Directions for PIM System Design

Insights from RH+:

  1. Address mapping needs to be customized based on workload access patterns;
  2. Hardware-software co-design is crucial;
  3. Cycle-accurate simulation is a necessary means to identify core bottlenecks.
7

Section 07

Limitations and Future Research Directions

Limitations of RH+ and future explorations:

  1. Extending to multi-bank parallel scenarios;
  2. Adapting to other operations like attention computation in LLM inference;
  3. Validation on actual HBM3 PIM hardware.
8

Section 08

Conclusion: Value and Core Insights of RH+

RH+ achieves breakthrough optimization through precise identification of the real bottleneck (nRC) and simple step adjustment. Its success proves that understanding the core bottleneck of the system is more important than complex optimizations, and simple solutions can often effectively unleash hardware potential, providing key optimization ideas for LLM inference on PIM architectures.