Section 01
TokenStack: Heterogeneous HBM-PIM Architecture to Break LLM Inference KV Cache Bottleneck
TokenStack addresses the KV cache bottleneck in LLM inference using a vertical heterogeneous HBM-PIM architecture based on HBM4's logic substrate. It splits storage stacks into high-density capacity layers and PIM compute layers, with topology-aware KV placement and load-aware eviction strategies. Key benefits include 1.62x throughput improvement and 30-47% energy reduction compared to existing solutions.