# TokenStack: Heterogeneous HBM-PIM Architecture Breaks Through KV Cache Bottleneck in LLM Inference

> TokenStack leverages HBM4's logic substrate to split the storage stack into high-density capacity layers and PIM compute layers. Through topology-aware KV placement and load-aware eviction strategies, it achieves a 1.62x throughput improvement and a 30-47% reduction in energy consumption.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-07T03:47:18.000Z
- 最近活动: 2026-05-08T03:50:55.406Z
- 热度: 135.9
- 关键词: TokenStack, HBM-PIM, KV缓存, 内存内计算, 异构架构, LLM推理, HBM4, 注意力计算
- 页面链接: https://www.zingnex.cn/en/forum/thread/tokenstack-hbm-pimllmkv
- Canonical: https://www.zingnex.cn/forum/thread/tokenstack-hbm-pimllmkv
- Markdown 来源: floors_fallback

---

## TokenStack: Heterogeneous HBM-PIM Architecture to Break LLM Inference KV Cache Bottleneck

TokenStack addresses the KV cache bottleneck in LLM inference using a vertical heterogeneous HBM-PIM architecture based on HBM4's logic substrate. It splits storage stacks into high-density capacity layers and PIM compute layers, with topology-aware KV placement and load-aware eviction strategies. Key benefits include 1.62x throughput improvement and 30-47% energy reduction compared to existing solutions.

## KV Cache Bottleneck & Limitations of Current HBM-PIM Solutions

KV cache is a major bottleneck in LLM inference—decoding each new token requires reading all previous KV states, making attention bandwidth and capacity-intensive. HBM-PIM offers hope but existing designs have flaws:
- **Unified PIM stacks**: All layers pay for PIM logic (even unused), wasting area/power.
- **Dedicated PIM designs**: Separate PIM and storage layers reduce HBM bandwidth for GPU-side tasks (like weight access), creating new bottlenecks.

## TokenStack's Vertical Heterogeneous HBM-PIM Design

TokenStack leverages HBM4's logic substrate to build a vertical heterogeneous architecture:
1. **Layer division**:
   - **High-density capacity layers**: For weights, activations, cold KV (no PIM logic, cost-effective, high GPU bandwidth).
   - **PIM compute layers**: For hot KV attention (integrated PIM, low latency/energy).
2. **Logic substrate controller**: Manages cross-layer DMA, hierarchical address translation, attention data coordination, and inline quantization (transparent to upper software).

## Runtime Smart Data Management for TokenStack

TokenStack's runtime system optimizes data handling:
- **Topology-aware KV placement**: Hot KV → PIM layers; warm KV → dynamic migration based on future access prediction; cold KV → compressed in capacity layers.
- **Load-aware eviction**: Prioritizes evicting least recently used blocks, retains blocks with larger attention spans, uses request pattern prediction.
- **Bounded replication**: Allows limited copies of hot KV in both layers to balance access efficiency and storage overhead.

## Experimental Results: Performance & Energy Efficiency Gains

Evaluations on production traces with 4 mainstream models show:
- **Throughput**: 1.62x geometric mean token throughput vs AttAcc; 1.70x SLO-compliant service capacity.
- **Energy**: 30-47% per-token energy reduction.
- **High QPS**: Better performance under high concurrency as heterogeneous architecture disperses bandwidth pressure.

## HBM4's Role & Deployment Considerations

TokenStack relies on HBM4's key features:
- **Logic substrate**: HBM4's integrated logic die (traditionally for interfaces) is repurposed as a smart controller.
- **Vertical stack**: Natural for heterogeneous layers, more energy-efficient than planar designs.

Deployment considerations:
- **Hardware**: Requires HBM4 (upgrade existing GPU infrastructure or adopt in new data centers).
- **Software**: Needs integration with inference frameworks (vLLM, TensorRT-LLM) for transparency.
- **Workload**: Most beneficial for KV-intensive tasks (long context, document generation); less for short queries.
- **Scalability**: Supports multi-GPU but needs careful cross-GPU KV management.

## Limitations & Future Improvements of TokenStack

Current limitations and future work:
- **Static layers**: Fixed layer roles; future could explore dynamic reconfiguration based on workload.
- **Prediction accuracy**: Improve KV access prediction with advanced ML models.
- **Sparse attention synergy**: Optimize with sparse attention (sliding window, local attention) to reduce KV needs.
- **Multi-modal extension**: Adapt to handle KV cache for image tokens in multi-modal models.

## Conclusion & Industry Implications of TokenStack

TokenStack provides an elegant solution to LLM KV cache bottlenecks via heterogeneous HBM-PIM architecture, with significant throughput and energy gains. It demonstrates hardware-software co-design for AI workloads.

Industry impact:
- **Hardware**: Pushes HBM-PIM innovation for AI-optimized memory.
- **Cloud providers**: Reduces LLM service costs.
- **End users**: Faster, cheaper AI services.

As HBM4 becomes widely adopted, similar innovations will drive LLM efficiency further.
