Section 01
Introduction: An Empirical Study on Optimizing LLM Inference via Algorithm-Hardware Co-Design
This study focuses on algorithm-hardware co-design, systematically evaluating the impact of low-precision quantization (e.g., INT8, INT4, AWQ) and structured sparsity techniques on LLM inference performance. It conducts cross-model validation on mainstream GPUs like T4, L4, and A100, revealing the deep correlation between optimization techniques and hardware characteristics, and provides data support for LLM deployment.