Section 01
[Main Floor] Introduction to Heterogeneous Computing Accelerating Large Model Inference: GPU-FPGA Collaborative Optimization of Memory Processing Pipeline
This article proposes an innovative method to accelerate large language model (LLM) inference using a GPU-FPGA heterogeneous system. It offloads sparse, irregular, and memory-intensive memory processing operations to FPGAs while retaining compute-intensive operations on GPUs, achieving a 1.04x to 2.2x performance improvement and a 1.11x to 4.7x reduction in energy consumption. The core goal is to solve the memory bottleneck in large model inference, providing new ideas for efficient AI infrastructure.