Zing Forum

Reading

Implementing Edge-side Large Model Inference on RK3588 NPU: A Complete Pipeline from HuggingFace to rkllama

This project demonstrates how to implement a complete edge-side LLM inference solution on Rockchip RK3588/RK3588S NPU, covering model conversion, quantization deployment, and Ollama-compatible API services, providing a reproducible technical path for edge AI devices to run large language models.

RK3588NPUedge inferencellamaOllamaquantizationw8a8rkllmon-device AIOrange Pi
Published 2026-04-17 05:17Recent activity 2026-04-17 05:25Estimated read 7 min
Implementing Edge-side Large Model Inference on RK3588 NPU: A Complete Pipeline from HuggingFace to rkllama
1

Section 01

Introduction: Full-process Solution for Edge-side Large Model Inference on RK3588 NPU

This project demonstrates how to implement a complete edge-side LLM inference solution on Rockchip RK3588/RK3588S NPU, covering model conversion, quantization deployment, and Ollama-compatible API services, providing a reproducible technical path for edge AI devices to run large language models. The project aims to run open-source models such as Google Gemma4 E2B on RK3588 NPU, using a layered architecture, and forms a sister repository with the kernel driver project rknpu-rk3588.

2

Section 02

Background: Needs and Challenges of Large Model Inference on Edge Devices

With the improvement of LLM capabilities, the demand for edge deployment is growing (reducing latency, protecting privacy, offline services), but running large models on resource-constrained devices faces challenges. RK3588/RK3588S is a high-performance AIoT SoC with a built-in 3-core NPU providing 6TOPS of computing power, widely used in development boards like Orange Pi5 Pro. How to efficiently run LLMs is an important topic in edge AI.

3

Section 03

Technical Architecture: Model Conversion Pipeline and Core Components

The project uses Rockchip's official rkllm-toolkit to convert HuggingFace models into the RK3588-specific .rkllm format. The process includes: weight quantization (w8a8, reducing size and memory usage), calibration optimization (using representative prompts to reduce precision loss), and CI/CD integration (automatic conversion via GitHub Actions, taking about 16 minutes per run). The converted files can be directly deployed without a PyTorch environment. Division of labor between this project and the sister repository rknpu-rk3588: the latter is responsible for driver and hardware support, while this project focuses on the upper-layer toolchain and inference services.

4

Section 04

Inference Service Deployment: Ollama Compatibility and Lightweight Solutions

The project supports two service solutions: 1. rkllama (recommended): Based on community projects, it provides an Ollama-compatible HTTP API, allowing seamless migration of existing Ollama ecosystem tools; 2. Lightweight self-developed server: Directly calls the librkllmrt.so runtime, suitable for scenarios with extremely limited resources. The inference service runs as a systemd unit, with features such as auto-start on boot, crash restart, resource isolation, and log rotation.

5

Section 05

Actual Performance and Operating Environment Requirements

Verified on Orange Pi5 Pro (RK3588S, 6TOPS NPU): Qwen2.5-0.5B-Instruct (w8a8 quantization) has an inference speed of about 9 tok/s and supports the Ollama API. Hardware requirements: Orange Pi5 Pro, 3-core NPU; Software dependencies: NPU driver loaded (rknpu0.9.8), rkllm-toolkit (x86 workstation), rkllama or custom server (ARM device). Precondition: Complete the Quick Start of the rknpu-rk3588 project to ensure the driver is installed correctly.

6

Section 06

Quick Start Guide: Conversion, Deployment, and Verification

Model Conversion (x86 Workstation):cd conversion → pip install -r requirements.txt → python convert.py --model Qwen2.5-0.5B-Instruct --output model.rkllm; or trigger conversion via GitHub Actions (requires GITHUB_TOKEN). Board-side Deployment: cd serving → sudo ./install.sh → sudo systemctl enable --now rkllama. Verification: Use curl to call the localhost:8080/api/generate interface to test the conversation.

7

Section 07

Technical Challenges and Solutions

  1. Conversion Resource Limitations: Converting large models (e.g., Gemma4 E2B) requires a lot of resources, exceeding the limits of GitHub's free runners → It is recommended to use a local workstation or paid CI. 2. Quantization Precision Loss: INT8 quantization may reduce precision → Balance speed and precision through calibration datasets and parameter tuning; the Qwen2.5-0.5B w8a8 has been tested to be stable in dialogue tasks. 3. Ecosystem Compatibility: Edge NPU ecosystems are fragmented → Compatible with Ollama API to reuse existing ecosystems, and keep the architecture clear for easy migration.
8

Section 08

Application Scenarios and Project Summary

Application Scenarios: Offline intelligent assistants (wilderness/confidential places), low-latency interactions (real-time applications), privacy protection (medical/financial compliance), cost optimization (replacing cloud API fees). Summary: The gemma-rk3588 project fully demonstrates the complete process from HuggingFace model to RK3588 NPU deployment, providing a reproducible reference for edge AI developers. With the improvement of NPU computing power and quantization technology, running more powerful LLMs on the edge becomes more feasible, and the open-source practice of this project provides valuable engineering experience.