Zing Forum

Reading

Heterogeneous Computing Accelerates Large Model Inference: GPU-FPGA Collaborative Optimization of Memory Processing Pipeline

This article introduces an innovative method to accelerate large language model (LLM) inference using a GPU-FPGA heterogeneous system. It offloads sparse, irregular, and memory-intensive memory processing operations to FPGAs while retaining compute-intensive operations on GPUs, achieving a 1.04x to 2.2x performance improvement and a 1.11x to 4.7x reduction in energy consumption.

异构计算GPU-FPGA协同大模型推理加速内存处理优化稀疏注意力能效优化
Published 2026-03-31 05:03Recent activity 2026-04-01 10:17Estimated read 5 min
Heterogeneous Computing Accelerates Large Model Inference: GPU-FPGA Collaborative Optimization of Memory Processing Pipeline
1

Section 01

[Main Floor] Introduction to Heterogeneous Computing Accelerating Large Model Inference: GPU-FPGA Collaborative Optimization of Memory Processing Pipeline

This article proposes an innovative method to accelerate large language model (LLM) inference using a GPU-FPGA heterogeneous system. It offloads sparse, irregular, and memory-intensive memory processing operations to FPGAs while retaining compute-intensive operations on GPUs, achieving a 1.04x to 2.2x performance improvement and a 1.11x to 4.7x reduction in energy consumption. The core goal is to solve the memory bottleneck in large model inference, providing new ideas for efficient AI infrastructure.

2

Section 02

Background: Memory Bottleneck in Large Model Inference

With the improvement of large language model (LLM) capabilities and the growing demand for long context processing, technologies like sparse attention and RAG bring computational overhead. Studies show that memory processing overhead accounts for 22% to 97% of modern LLM inference, becoming a key bottleneck. Traditional GPUs excel at regular, compute-intensive tensor operations but are inefficient at sparse, irregular, memory-intensive operations, inspiring the exploration of flexible heterogeneous architectures.

3

Section 03

Method Framework: Four-Step Memory Processing Pipeline and Heterogeneous Design Philosophy

The research unifies LLM optimization techniques into a four-step memory processing framework: 1. Prepare memory (organize preprocessed context); 2. Calculate relevance (evaluate the relevance between memory and queries); 3. Retrieve (obtain the most relevant memory); 4. Apply to inference (integrate results into generation). Core insight: Memory processing operations have sparse, memory-intensive, and control-intensive characteristics, making them suitable for FPGAs; GPUs are suitable for dense computations like regular matrix multiplication. Therefore, memory processing is offloaded to FPGAs, while GPUs retain core Transformer computations.

4

Section 04

System Implementation: AMD MI210 + Alveo U55C Heterogeneous Architecture

The team implemented the architecture on AMD MI210 GPUs and Alveo U55C FPGAs: The FPGA side handles sparse attention indexing, Top-K retrieval, memory compression/decompression, etc.; the GPU side focuses on dense computations like attention calculation and feed-forward networks; high-speed interconnection enables efficient scheduling of data and tasks, leveraging the FPGA's flexibility and low latency as well as the GPU's parallel computing advantages.

5

Section 05

Experimental Evidence: Dual Improvement in Performance and Energy Efficiency

Multi-scenario evaluations show: Compared to the pure GPU baseline, the heterogeneous system achieves a 1.04x to 2.2x speedup (most significant in sparse attention scenarios); energy consumption is reduced by 1.11x to 4.7x (energy savings are prominent in memory-intensive tasks); all optimizations do not lose model accuracy. The results also hold on NVIDIA A100 GPUs, verifying universality.

6

Section 06

Conclusion and Outlook: Future Directions of Heterogeneous Architectures

This work reveals: 1. General-purpose GPUs struggle to efficiently handle all LLM workloads, so heterogeneous architectures will become mainstream; 2. Future AI accelerators need to be designed closely with algorithm characteristics; 3. Energy efficiency optimization is as important as performance. This direction will influence the design paradigm of heterogeneous hardware and lay the foundation for efficient and sustainable AI infrastructure.