Zing Forum

Reading

TokenStack: Heterogeneous HBM-PIM Architecture Breaks Through KV Cache Bottleneck in LLM Inference

TokenStack leverages HBM4's logic substrate to split the storage stack into high-density capacity layers and PIM compute layers. Through topology-aware KV placement and load-aware eviction strategies, it achieves a 1.62x throughput improvement and a 30-47% reduction in energy consumption.

TokenStackHBM-PIMKV缓存内存内计算异构架构LLM推理HBM4注意力计算
Published 2026-05-07 11:47Recent activity 2026-05-08 11:50Estimated read 7 min
TokenStack: Heterogeneous HBM-PIM Architecture Breaks Through KV Cache Bottleneck in LLM Inference
1

Section 01

TokenStack: Heterogeneous HBM-PIM Architecture to Break LLM Inference KV Cache Bottleneck

TokenStack addresses the KV cache bottleneck in LLM inference using a vertical heterogeneous HBM-PIM architecture based on HBM4's logic substrate. It splits storage stacks into high-density capacity layers and PIM compute layers, with topology-aware KV placement and load-aware eviction strategies. Key benefits include 1.62x throughput improvement and 30-47% energy reduction compared to existing solutions.

2

Section 02

KV Cache Bottleneck & Limitations of Current HBM-PIM Solutions

KV cache is a major bottleneck in LLM inference—decoding each new token requires reading all previous KV states, making attention bandwidth and capacity-intensive. HBM-PIM offers hope but existing designs have flaws:

  • Unified PIM stacks: All layers pay for PIM logic (even unused), wasting area/power.
  • Dedicated PIM designs: Separate PIM and storage layers reduce HBM bandwidth for GPU-side tasks (like weight access), creating new bottlenecks.
3

Section 03

TokenStack's Vertical Heterogeneous HBM-PIM Design

TokenStack leverages HBM4's logic substrate to build a vertical heterogeneous architecture:

  1. Layer division:
    • High-density capacity layers: For weights, activations, cold KV (no PIM logic, cost-effective, high GPU bandwidth).
    • PIM compute layers: For hot KV attention (integrated PIM, low latency/energy).
  2. Logic substrate controller: Manages cross-layer DMA, hierarchical address translation, attention data coordination, and inline quantization (transparent to upper software).
4

Section 04

Runtime Smart Data Management for TokenStack

TokenStack's runtime system optimizes data handling:

  • Topology-aware KV placement: Hot KV → PIM layers; warm KV → dynamic migration based on future access prediction; cold KV → compressed in capacity layers.
  • Load-aware eviction: Prioritizes evicting least recently used blocks, retains blocks with larger attention spans, uses request pattern prediction.
  • Bounded replication: Allows limited copies of hot KV in both layers to balance access efficiency and storage overhead.
5

Section 05

Experimental Results: Performance & Energy Efficiency Gains

Evaluations on production traces with 4 mainstream models show:

  • Throughput: 1.62x geometric mean token throughput vs AttAcc; 1.70x SLO-compliant service capacity.
  • Energy: 30-47% per-token energy reduction.
  • High QPS: Better performance under high concurrency as heterogeneous architecture disperses bandwidth pressure.
6

Section 06

HBM4's Role & Deployment Considerations

TokenStack relies on HBM4's key features:

  • Logic substrate: HBM4's integrated logic die (traditionally for interfaces) is repurposed as a smart controller.
  • Vertical stack: Natural for heterogeneous layers, more energy-efficient than planar designs.

Deployment considerations:

  • Hardware: Requires HBM4 (upgrade existing GPU infrastructure or adopt in new data centers).
  • Software: Needs integration with inference frameworks (vLLM, TensorRT-LLM) for transparency.
  • Workload: Most beneficial for KV-intensive tasks (long context, document generation); less for short queries.
  • Scalability: Supports multi-GPU but needs careful cross-GPU KV management.
7

Section 07

Limitations & Future Improvements of TokenStack

Current limitations and future work:

  • Static layers: Fixed layer roles; future could explore dynamic reconfiguration based on workload.
  • Prediction accuracy: Improve KV access prediction with advanced ML models.
  • Sparse attention synergy: Optimize with sparse attention (sliding window, local attention) to reduce KV needs.
  • Multi-modal extension: Adapt to handle KV cache for image tokens in multi-modal models.
8

Section 08

Conclusion & Industry Implications of TokenStack

TokenStack provides an elegant solution to LLM KV cache bottlenecks via heterogeneous HBM-PIM architecture, with significant throughput and energy gains. It demonstrates hardware-software co-design for AI workloads.

Industry impact:

  • Hardware: Pushes HBM-PIM innovation for AI-optimized memory.
  • Cloud providers: Reduces LLM service costs.
  • End users: Faster, cheaper AI services.

As HBM4 becomes widely adopted, similar innovations will drive LLM efficiency further.