Reading

TokenStack: Heterogeneous HBM-PIM Architecture Breaks Through KV Cache Bottleneck in LLM Inference

TokenStack leverages HBM4's logic substrate to split the storage stack into high-density capacity layers and PIM compute layers. Through topology-aware KV placement and load-aware eviction strategies, it achieves a 1.62x throughput improvement and a 30-47% reduction in energy consumption.

TokenStackHBM-PIMKV缓存内存内计算异构架构LLM推理HBM4注意力计算

Published 2026-05-07 11:47Recent activity 2026-05-08 11:50Estimated read 7 min

TokenStack: Heterogeneous HBM-PIM Architecture Breaks Through KV Cache Bottleneck in LLM Inference

Section 01

TokenStack: Heterogeneous HBM-PIM Architecture to Break LLM Inference KV Cache Bottleneck

TokenStack addresses the KV cache bottleneck in LLM inference using a vertical heterogeneous HBM-PIM architecture based on HBM4's logic substrate. It splits storage stacks into high-density capacity layers and PIM compute layers, with topology-aware KV placement and load-aware eviction strategies. Key benefits include 1.62x throughput improvement and 30-47% energy reduction compared to existing solutions.

Section 02

KV Cache Bottleneck & Limitations of Current HBM-PIM Solutions

KV cache is a major bottleneck in LLM inference—decoding each new token requires reading all previous KV states, making attention bandwidth and capacity-intensive. HBM-PIM offers hope but existing designs have flaws:

Unified PIM stacks: All layers pay for PIM logic (even unused), wasting area/power.
Dedicated PIM designs: Separate PIM and storage layers reduce HBM bandwidth for GPU-side tasks (like weight access), creating new bottlenecks.

Section 03

TokenStack's Vertical Heterogeneous HBM-PIM Design

TokenStack leverages HBM4's logic substrate to build a vertical heterogeneous architecture:

Layer division:
- High-density capacity layers: For weights, activations, cold KV (no PIM logic, cost-effective, high GPU bandwidth).
- PIM compute layers: For hot KV attention (integrated PIM, low latency/energy).
Logic substrate controller: Manages cross-layer DMA, hierarchical address translation, attention data coordination, and inline quantization (transparent to upper software).

Section 04

Runtime Smart Data Management for TokenStack

TokenStack's runtime system optimizes data handling:

Topology-aware KV placement: Hot KV → PIM layers; warm KV → dynamic migration based on future access prediction; cold KV → compressed in capacity layers.
Load-aware eviction: Prioritizes evicting least recently used blocks, retains blocks with larger attention spans, uses request pattern prediction.
Bounded replication: Allows limited copies of hot KV in both layers to balance access efficiency and storage overhead.

Section 05

Experimental Results: Performance & Energy Efficiency Gains

Evaluations on production traces with 4 mainstream models show:

Throughput: 1.62x geometric mean token throughput vs AttAcc; 1.70x SLO-compliant service capacity.
Energy: 30-47% per-token energy reduction.
High QPS: Better performance under high concurrency as heterogeneous architecture disperses bandwidth pressure.

Section 06

HBM4's Role & Deployment Considerations

TokenStack relies on HBM4's key features:

Logic substrate: HBM4's integrated logic die (traditionally for interfaces) is repurposed as a smart controller.
Vertical stack: Natural for heterogeneous layers, more energy-efficient than planar designs.

Deployment considerations:

Hardware: Requires HBM4 (upgrade existing GPU infrastructure or adopt in new data centers).
Software: Needs integration with inference frameworks (vLLM, TensorRT-LLM) for transparency.
Workload: Most beneficial for KV-intensive tasks (long context, document generation); less for short queries.
Scalability: Supports multi-GPU but needs careful cross-GPU KV management.

Section 07

Limitations & Future Improvements of TokenStack

Current limitations and future work:

Static layers: Fixed layer roles; future could explore dynamic reconfiguration based on workload.
Prediction accuracy: Improve KV access prediction with advanced ML models.
Sparse attention synergy: Optimize with sparse attention (sliding window, local attention) to reduce KV needs.
Multi-modal extension: Adapt to handle KV cache for image tokens in multi-modal models.

Section 08

Conclusion & Industry Implications of TokenStack

TokenStack provides an elegant solution to LLM KV cache bottlenecks via heterogeneous HBM-PIM architecture, with significant throughput and energy gains. It demonstrates hardware-software co-design for AI workloads.

Industry impact:

Hardware: Pushes HBM-PIM innovation for AI-optimized memory.
Cloud providers: Reduces LLM service costs.
End users: Faster, cheaper AI services.

As HBM4 becomes widely adopted, similar innovations will drive LLM efficiency further.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15