Zing Forum

Reading

HeteroInfer-Lab: A Research Framework for Large Model Inference Optimization on Edge Devices

A systematic research project on large model inference at the edge, focusing on KV cache management, heterogeneous acceleration, and performance bottleneck analysis

边缘推理KV Cache优化异构计算LLM性能分析FPGA加速CUDA优化
Published 2026-05-02 00:45Recent activity 2026-05-02 00:52Estimated read 5 min
HeteroInfer-Lab: A Research Framework for Large Model Inference Optimization on Edge Devices
1

Section 01

Introduction / Main Floor: HeteroInfer-Lab: A Research Framework for Large Model Inference Optimization on Edge Devices

A systematic research project on large model inference at the edge, focusing on KV cache management, heterogeneous acceleration, and performance bottleneck analysis

2

Section 02

Project Background and Research Motivation

With the popularization of Large Language Models (LLMs) in various application scenarios, how to achieve efficient inference on resource-constrained edge devices has become a key challenge. Traditional cloud-based inference solutions face problems such as high latency, high privacy risks, and strong network dependence, while directly deploying large models on edge devices is limited by computing power and memory resources.

HeteroInfer-Lab is a research framework born to address this pain point. Initiated by TianyiLan, this project aims to systematically study and optimize large model inference performance in heterogeneous hardware environments such as single-GPU cards, edge servers, small workstations, and even FPGAs and NPUs.

3

Section 03

Core Research Directions

The project focuses on the real performance bottlenecks of LLM inference and forms a clear research path:

4

Section 04

1. Prefill and Decode Performance Analysis

Large model inference consists of two phases: Prefill (pre-filling, processing input prompts) and Decode (decoding, generating outputs token by token). These two phases have distinct performance characteristics: the Prefill phase is compute-intensive, while the Decode phase is limited by memory bandwidth. The project establishes a profiling framework to accurately measure key metrics such as TTFT (Time to First Token) and TPOT (Time per Output Token).

5

Section 05

2. KV Cache Management and Optimization

KV Cache is a core data structure in Transformer model inference, used to store key-value pairs in the attention mechanism. During the autoregressive generation process, the memory usage of KV Cache grows linearly with the sequence length, often becoming a bottleneck for edge deployment. The project deeply studies the memory overhead characteristics of KV Cache and explores optimization strategies such as compression, quantization, and recomputation.

6

Section 06

3. CUDA Decode Kernel Optimization

To address the memory bandwidth bottleneck in the Decode phase, the project plans to develop customized CUDA kernels to improve decoding efficiency through methods such as optimizing memory access patterns, utilizing shared memory, and fusing operators.

7

Section 07

4. Heterogeneous Execution and Hardware Collaboration

The long-term vision of the project is to realize collaborative computing among GPUs, FPGAs, and NPUs. By rationally allocating computing tasks to different hardware units, the advantages of various accelerators are fully utilized. For example, compute-intensive matrix operations are assigned to GPUs, while low-latency control logic is handled by FPGAs.

8

Section 08

5. FPGA HLS and Dataflow Generation

The project plans to explore High-Level Synthesis (HLS) technology to automatically generate FPGA dataflow architectures from algorithm descriptions. This involves the optimization of MLIR intermediate representations and the development of dedicated compiler passes.