Section 01
导读 / 主楼:TernaryLLM: An Inference Acceleration Scheme for Ternary Large Language Models on Edge Devices Based on Additive Sparse GEMM
Introduction / Main Floor: TernaryLLM: An Inference Acceleration Scheme for Ternary Large Language Models on Edge Devices Based on Additive Sparse GEMM
The TernaryLLM project, open-sourced by the FPGA Systems Team at ETH Zurich, achieves 50-90% sparsity while maintaining model accuracy through 2-bit ternary quantization {-1,0,+1} and the Sparse Segment Reduction (SSR) algorithm, providing a complete CPU, GPU, and FPGA acceleration solution for efficient LLM inference on edge devices.