# EdgeLLM-Systems: A Research Framework for Large Model Inference Systems on Edge Devices

> EdgeLLM-Systems is a research project focused on large model inference systems in resource-constrained edge environments. It provides a complete toolchain for performance profiling, memory footprint analysis, and inference efficiency evaluation, supporting deployment optimization of models like LLaMA on edge devices.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-13T13:47:25.000Z
- 最近活动: 2026-06-13T13:58:21.867Z
- 热度: 159.8
- 关键词: 边缘计算, 大模型推理, LLaMA, KV缓存优化, 性能剖析, 边缘AI, 内存优化, 推理效率
- 页面链接: https://www.zingnex.cn/en/forum/thread/edgellm-systems
- Canonical: https://www.zingnex.cn/forum/thread/edgellm-systems
- Markdown 来源: floors_fallback

---

## EdgeLLM-Systems: Introduction to the Research Framework for Large Model Inference Systems on Edge Devices

EdgeLLM-Systems is a GitHub project maintained by TianyiLan (Original link: https://github.com/TianyiLan/EdgeLLM-Systems, Update time: 2026-06-13T13:47:25Z), focusing on research of large model inference systems in resource-constrained edge environments. The project provides a complete toolchain for performance profiling, memory footprint analysis, and inference efficiency evaluation, supporting deployment optimization of models like LLaMA on edge devices. Its core content covers target edge platform classification, three-dimensional measurement framework, experimental results, technical toolchain, and future directions, providing data-driven references for edge AI deployment.

## Project Background and Motivation

With the popularization of large language models (LLMs) in various application scenarios, how to efficiently deploy and run these models on resource-constrained edge devices has become a key challenge. Edge devices usually face constraints such as limited memory, bandwidth limitations, and low latency requirements, making traditional cloud deployment solutions difficult to migrate directly. EdgeLLM-Systems aims to solve this problem by providing a systematic framework for performance profiling, optimization, and heterogeneous hardware acceleration of large model inference in edge environments.

## Core Research Objectives and Target Edge Platforms

EdgeLLM-Systems focuses on two typical edge computing platforms:

### Host-centric Edge Platforms
Centered on x86 or ARM architecture hosts, paired with independent GPU or FPGA acceleration cards. Typical scenarios include personal computers, small workstations, and edge servers. The challenge is to load and run large models within a limited memory budget while maintaining acceptable inference latency.

### SoC-integrated Edge Platforms
Integrate computing units such as CPU, GPU, and NPU into the same System-on-Chip (SoC), commonly found in smartphones, robots, and embedded AI devices like Jetson and Orin. Resources are more constrained, requiring more refined optimization strategies.

## Three-dimensional Measurement Framework

The project adopts a three-category measurement system aligned with mainstream academic benchmarks (MLPerf Inference, MobileLLM, LLM-in-a-Flash):

### Memory Footprint
Focuses on model deployability. Core metrics include model load memory (model_load_mem_mb), peak memory (peak_mem_mb), KV cache size (kv_pkv_final_mb), KV payload ratio (kv_payload_ratio), etc., helping to understand memory requirements under different context lengths.

### Inference Efficiency
Measures inference speed. Core metrics include Time to First Token (TTFT), Time per Output Token (TPOT), total latency (total_latency_ms), and throughput (tokens/s), which directly affect the user experience of interactive applications.

### Model Quality
Evaluates the preservation of accuracy. Uses standard text benchmarks such as MMLU-Pro, GSM8K, HellaSwag, WinoGrande, and TruthfulQA MC1 to ensure that optimization does not significantly impair model capabilities.

## Experimental Results and Key Findings

The project completed comprehensive testing of LLaMA-3.2-1B-Instruct and LLaMA-3.2-3B-Instruct, establishing FP16 precision baseline data on Google Colab L4 GPU:

### Memory Footprint Analysis
The 1B parameter model runs stably under a 32768-token long context, with a peak memory of approximately 11.5 GB; the 3B parameter model reaches a peak memory of 18 GB under the same conditions (close to the capacity limit of L4 GPU). Key finding: Short context memory is dominated by model weights, while KV Cache and prefill phase peaks rise significantly in long contexts, becoming the main pressure source for deployment boundaries.

### Inference Efficiency Performance
The 1B model achieves about 50 tokens/s in short input scenarios, maintaining 39.6 tokens/s for boundary input (32768 prompts) with a TTFT of 3.46 seconds; the 3B model reaches about 29.6 tokens/s in short input, dropping to 13.0 tokens/s for boundary input with TTFT increasing to 9.04 seconds. Long contexts have a non-linear impact on inference efficiency, with the prefill and decode phases entering more obvious bandwidth/capacity pressure zones.

### Model Quality Verification
Compared to the 1B model, the 3B model shows significant improvements in tasks such as knowledge reasoning (MMLU-Pro: 33.33% vs 19.25%), mathematical reasoning (GSM8K:67.40% vs36.80%), and common sense reasoning (WinoGrande:73.20% vs61.40%), verifying the positive correlation between model scale and capability.

## Technical Architecture and Toolchain

The project provides a complete Python toolkit:
- **profiling_core.py**: Core profiling engine that coordinates the collection of various performance metrics
- **memory_profiler.py**: Memory footprint analysis API that tracks memory behaviors such as model loading and KV cache
- **efficiency_profiler.py**: Inference efficiency analysis API that measures latency and throughput metrics
- **kv_cache.py**: Specialized analysis tool for KV cache
- **lm_eval_runner.py**: Model quality evaluation runner based on lm-evaluation-harness

All measurement results are output in CSV format, divided into raw data and summary data categories, facilitating subsequent analysis and visualization.

## Future Research Directions and Practical Value

### Future Directions
The planned exp002 will expand to the multimodal field, performing performance profiling of LLaMA-3.2-11B-Vision in Vision-Language scenarios, adding detailed metrics such as image preprocessing, vision encoder, projector, and image token.

### Practical Value
- **Edge AI Product Selection**: Estimate the performance of different scale models on target hardware through public benchmark data
- **Deployment Boundary Evaluation**: Determine the maximum context length and concurrency supported under specific hardware configurations
- **Optimization Strategy Verification**: Provide a standardized method for evaluating the effectiveness of technologies such as quantization, pruning, and KV Cache optimization
- **Hardware Selection Decision**: Compare the performance of different platforms to guide hardware selection for edge devices

## Project Summary

EdgeLLM-Systems represents a pragmatic research path for large model edge deployment. Instead of pursuing theoretically optimal solutions, it provides real and reliable performance data through systematic measurement and analysis. As edge AI becomes increasingly important today, this data-driven research method will provide a solid foundation for more practical applications.
