Reading

Implementing Edge-side Large Model Inference on RK3588 NPU: A Complete Pipeline from HuggingFace to rkllama

RK3588NPUedge inferencellamaOllamaquantizationw8a8rkllmon-device AIOrange Pi

Published 2026-04-17 05:17Recent activity 2026-04-17 05:25Estimated read 7 min

Implementing Edge-side Large Model Inference on RK3588 NPU: A Complete Pipeline from HuggingFace to rkllama

Section 01

Introduction: Full-process Solution for Edge-side Large Model Inference on RK3588 NPU

This project demonstrates how to implement a complete edge-side LLM inference solution on Rockchip RK3588/RK3588S NPU, covering model conversion, quantization deployment, and Ollama-compatible API services, providing a reproducible technical path for edge AI devices to run large language models. The project aims to run open-source models such as Google Gemma4 E2B on RK3588 NPU, using a layered architecture, and forms a sister repository with the kernel driver project rknpu-rk3588.

Section 02

Background: Needs and Challenges of Large Model Inference on Edge Devices

With the improvement of LLM capabilities, the demand for edge deployment is growing (reducing latency, protecting privacy, offline services), but running large models on resource-constrained devices faces challenges. RK3588/RK3588S is a high-performance AIoT SoC with a built-in 3-core NPU providing 6TOPS of computing power, widely used in development boards like Orange Pi5 Pro. How to efficiently run LLMs is an important topic in edge AI.

Section 03

Technical Architecture: Model Conversion Pipeline and Core Components

The project uses Rockchip's official rkllm-toolkit to convert HuggingFace models into the RK3588-specific .rkllm format. The process includes: weight quantization (w8a8, reducing size and memory usage), calibration optimization (using representative prompts to reduce precision loss), and CI/CD integration (automatic conversion via GitHub Actions, taking about 16 minutes per run). The converted files can be directly deployed without a PyTorch environment. Division of labor between this project and the sister repository rknpu-rk3588: the latter is responsible for driver and hardware support, while this project focuses on the upper-layer toolchain and inference services.

Section 04

Inference Service Deployment: Ollama Compatibility and Lightweight Solutions

The project supports two service solutions: 1. rkllama (recommended): Based on community projects, it provides an Ollama-compatible HTTP API, allowing seamless migration of existing Ollama ecosystem tools; 2. Lightweight self-developed server: Directly calls the librkllmrt.so runtime, suitable for scenarios with extremely limited resources. The inference service runs as a systemd unit, with features such as auto-start on boot, crash restart, resource isolation, and log rotation.

Section 05

Actual Performance and Operating Environment Requirements

Verified on Orange Pi5 Pro (RK3588S, 6TOPS NPU): Qwen2.5-0.5B-Instruct (w8a8 quantization) has an inference speed of about 9 tok/s and supports the Ollama API. Hardware requirements: Orange Pi5 Pro, 3-core NPU; Software dependencies: NPU driver loaded (rknpu0.9.8), rkllm-toolkit (x86 workstation), rkllama or custom server (ARM device). Precondition: Complete the Quick Start of the rknpu-rk3588 project to ensure the driver is installed correctly.

Section 06

Quick Start Guide: Conversion, Deployment, and Verification

Model Conversion (x86 Workstation)：cd conversion → pip install -r requirements.txt → python convert.py --model Qwen2.5-0.5B-Instruct --output model.rkllm; or trigger conversion via GitHub Actions (requires GITHUB_TOKEN). Board-side Deployment: cd serving → sudo ./install.sh → sudo systemctl enable --now rkllama. Verification: Use curl to call the localhost:8080/api/generate interface to test the conversation.

Section 07

Technical Challenges and Solutions

Conversion Resource Limitations: Converting large models (e.g., Gemma4 E2B) requires a lot of resources, exceeding the limits of GitHub's free runners → It is recommended to use a local workstation or paid CI. 2. Quantization Precision Loss: INT8 quantization may reduce precision → Balance speed and precision through calibration datasets and parameter tuning; the Qwen2.5-0.5B w8a8 has been tested to be stable in dialogue tasks. 3. Ecosystem Compatibility: Edge NPU ecosystems are fragmented → Compatible with Ollama API to reuse existing ecosystems, and keep the architecture clear for easy migration.

Section 08

Application Scenarios and Project Summary

Application Scenarios: Offline intelligent assistants (wilderness/confidential places), low-latency interactions (real-time applications), privacy protection (medical/financial compliance), cost optimization (replacing cloud API fees). Summary: The gemma-rk3588 project fully demonstrates the complete process from HuggingFace model to RK3588 NPU deployment, providing a reproducible reference for edge AI developers. With the improvement of NPU computing power and quantization technology, running more powerful LLMs on the edge becomes more feasible, and the open-source practice of this project provides valuable engineering experience.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15