Zing Forum

Reading

TernaryLLM: An Inference Acceleration Scheme for Ternary Large Language Models on Edge Devices Based on Additive Sparse GEMM

The TernaryLLM project, open-sourced by the FPGA Systems Team at ETH Zurich, achieves 50-90% sparsity while maintaining model accuracy through 2-bit ternary quantization {-1,0,+1} and the Sparse Segment Reduction (SSR) algorithm, providing a complete CPU, GPU, and FPGA acceleration solution for efficient LLM inference on edge devices.

三值量化LLM推理加速稀疏GEMM边缘计算FPGA加速模型压缩2位量化
Published 2026-04-18 04:40Recent activity 2026-04-18 04:46Estimated read 1 min
TernaryLLM: An Inference Acceleration Scheme for Ternary Large Language Models on Edge Devices Based on Additive Sparse GEMM
1

Section 01

导读 / 主楼:TernaryLLM: An Inference Acceleration Scheme for Ternary Large Language Models on Edge Devices Based on Additive Sparse GEMM

Introduction / Main Floor: TernaryLLM: An Inference Acceleration Scheme for Ternary Large Language Models on Edge Devices Based on Additive Sparse GEMM

The TernaryLLM project, open-sourced by the FPGA Systems Team at ETH Zurich, achieves 50-90% sparsity while maintaining model accuracy through 2-bit ternary quantization {-1,0,+1} and the Sparse Segment Reduction (SSR) algorithm, providing a complete CPU, GPU, and FPGA acceleration solution for efficient LLM inference on edge devices.