Zing Forum

Reading

VitaLLM: A Mixed-Precision Accelerator for Ternary-Weight LLMs on Edge Devices

VitaLLM is a mixed-precision accelerator supporting ternary-weight LLMs. With its dual-core design (TINT and BoothFlex) and predictive sparse attention mechanism, it achieves a decoding speed of 72.46 tokens/s and a prefill time of 0.88 seconds under the 16nm process, while occupying only 0.214 mm² area and 120KB on-chip memory.

边缘AILLM加速器三值权重混合精度稀疏注意力BitNetTransformer硬件低功耗设计
Published 2026-05-01 08:59Recent activity 2026-05-04 10:48Estimated read 5 min
VitaLLM: A Mixed-Precision Accelerator for Ternary-Weight LLMs on Edge Devices
1

Section 01

VitaLLM Overview: Key Highlights of the Ternary-Weight LLM Mixed-Precision Accelerator for Edge Devices

VitaLLM is a mixed-precision accelerator for ternary-weight LLMs on edge devices. It adopts a dual-core design (TINT: Multiplier-Free Ternary-Integer Projection and BoothFlex: Reusable Radix-4 Booth Data Path) combined with a predictive sparse attention mechanism. Under the 16nm process, it achieves a decoding speed of 72.46 tokens/s, a prefill time of 0.88 seconds, occupies 0.214 mm² area and 120KB on-chip memory, and solves the precision-efficiency trade-off problem in edge LLM deployment.

2

Section 02

Background Challenges of Edge LLM Deployment

Deploying LLMs on edge devices faces a fundamental contradiction between precision and computational efficiency: traditional FP16/INT8 quantization still cannot meet resource-constrained requirements; ternary-weight models (e.g., BitNet b1.58) can maintain quality while reducing computational demands, but the lack of dedicated hardware accelerators limits their practical applications.

3

Section 03

Dual-Core Computing Architecture Design

TINT Core: Handles matrix multiplication between ternary weights and integer activations. It leverages the {-1,0,+1} weight characteristics to eliminate multipliers, and achieves efficient computation through look-up tables and sign selection circuits, reducing area and power consumption. BoothFlex Core: Based on radix-4 Booth encoding, it supports INT8×INT8 standard attention computation and ternary-INT continuous computation. Dynamic configuration switching is enabled without array duplication, realizing resource sharing.

4

Section 04

Predictive Sparse Attention Mechanism

To address the O(n²) complexity of Transformer attention, the following methods are adopted:

  1. Leading-One Proxy: Uses the position of the Leading-One in values as a proxy indicator for attention scores;
  2. Comparison-Free Top-K Selector: Locates the top-K candidates through bit pattern analysis, reducing KV cache reads and lowering memory bandwidth and computation load.
5

Section 05

System Integration Optimization Strategies

Head-Level Pipelining: Enables time multiplexing between multi-head attention, maintaining throughput while controlling hardware costs; Absmax Quantization Barrier: Unifies cross-core interfaces, supports dynamic quantization parameter calculation and propagation, simplifies data exchange, and controls communication overhead.

6

Section 06

Silicon Implementation and Performance Verification

16nm process, 1GHz frequency, 0.8V voltage; Key metrics under BitNet b1.58 model: 72.46 tokens/s decoding speed, 0.88s prefill (64 tokens), 0.214 mm² area, 120KB on-chip memory; Ablation experiments verify the effectiveness of sparse attention, dual-core architecture, and quantization barrier.

7

Section 07

Technical Insights and Industry Impact

Mixed precision is the inevitable path for edge AI; dedicated architectures are more efficient than general-purpose accelerators in edge scenarios; deep integration of hardware-algorithm collaboration (e.g., sparsity) is required to unleash potential.

8

Section 08

Limitations and Future Directions

Limitations: Limited model support (mainly BitNet b1.58), sparsity of long sequences (>4K tokens) to be verified, insufficient optimization in the training phase; Future: Expand quantization schemes (binary/quaternary), explore dynamic precision switching, and study multi-task resource sharing strategies.