With the popularization of Large Language Models (LLMs) in various application scenarios, how to achieve efficient inference on resource-constrained edge devices has become a key challenge. Traditional cloud-based inference solutions face problems such as high latency, high privacy risks, and strong network dependence, while directly deploying large models on edge devices is limited by computing power and memory resources.
HeteroInfer-Lab is a research framework born to address this pain point. Initiated by TianyiLan, this project aims to systematically study and optimize large model inference performance in heterogeneous hardware environments such as single-GPU cards, edge servers, small workstations, and even FPGAs and NPUs.