Reading

Running Large Language Models on FPGA Bare-Metal: Analysis of pccx NPU's KV260 Implementation

Exploring how the pccx FPGA NPU project achieves efficient LLM inference on the AMD Kria KV260 development board, covering key technical details such as W4A8 quantization, GEMM/GEMV data path design, and KV cache scheduling.

FPGALLM推理NPU量化Kria KV260边缘AISystemVerilogGEMMKV缓存

Published 2026-05-02 12:13Recent activity 2026-05-02 12:22Estimated read 6 min

Running Large Language Models on FPGA Bare-Metal: Analysis of pccx NPU's KV260 Implementation

Section 01

[Introduction] Analysis of pccx NPU's Bare-Metal LLM Execution on KV260

The pccx-FPGA-NPU-LLM-kv260 project is an open-source attempt to implement a dedicated Neural Processing Unit (NPU) via bare-metal on the AMD Kria KV260 development board, supporting efficient Large Language Model (LLM) inference. It covers key technologies such as W4A8 quantization, GEMM/GEMV data path design, and KV cache scheduling, providing a reference solution for edge AI deployment.

Section 02

Project Background and Motivation

With the popularization of LLMs, inference efficiency and hardware cost have become bottlenecks restricting deployment; traditional GPU solutions have high power consumption, high cost, and unstable supply chains. FPGAs have emerged as an important choice for edge AI inference due to their low latency, high energy efficiency, and programmability. The pccx project was born in this context to explore bare-metal NPU implementation on KV260 to support LLM inference.

Section 03

Overview of the Kria KV260 Hardware Platform

The KV260 is based on the Zynq UltraScale+ MPSoC architecture, integrating a quad-core ARM Cortex-A53, dual-core Cortex-R5, and a Programmable Logic (PL) section (approximately 1.3 million logic cells + abundant DSP resources). The heterogeneous architecture allows the PL to implement custom matrix operation acceleration engines, while the ARM runs lightweight scheduling software to achieve hardware-software co-optimization.

Section 04

Analysis of Core Technical Architecture

W4A8 Quantization Strategy: 4-bit weights compress the model size (reduced to 1/4~1/8), 8-bit activations ensure numerical stability, combined with dequantization/re-quantization logic; 2. GEMM/GEMV Design: GEMM uses systolic arrays to optimize batch multiplication in FFN layers, while GEMV is optimized for vector operations in attention mechanisms; 3. KV Cache Scheduling: Block management, on-chip/off-chip hierarchical storage (active cache resides in BRAM, historical data stored in DDR), and pipeline parallelism (overlap of computation and memory access).

Section 05

Implementation Details and System Integration

SystemVerilog RTL Design: Modular division (computation core, memory controller, etc.), parameterized configuration (easy to port), synthesis-friendly (optimized timing and resource utilization); - Driver Software: Responsible for model loading (reading quantized weights from SD card/network), inference scheduling (coordinating CPU and NPU tasks), performance monitoring (latency/throughput/power consumption measurement), and supports bare-metal deployment.

Section 06

Application Prospects and Significance

Technical Validation: Proves the feasibility of running modern LLMs on edge FPGAs; - Cost Advantage: The KV260 is affordable, providing a low-cost experimental platform for small and medium-sized enterprises/research institutions; - Customization Potential: Open-source RTL allows deep customization (specific models/quantization strategies); - Energy Efficiency Benchmark: FPGAs have better energy efficiency than general-purpose GPUs, suitable for battery-powered/heat-constrained scenarios.

Section 07

Conclusion

The pccx project represents the open-source hardware community's exploration in the field of AI acceleration, providing full-stack engineering practices from algorithm quantization to hardware architecture, RTL design to software drivers, making it a high-quality resource for learning AI chip design. With the development of LLM lightweighting, such open-source solutions are expected to play an important role in IoT, industrial intelligence, edge computing, and other fields.

Running Large Language Models on FPGA Bare-Metal: Analysis of pccx NPU's KV260 Implementation

[Introduction] Analysis of pccx NPU's Bare-Metal LLM Execution on KV260

Project Background and Motivation

Overview of the Kria KV260 Hardware Platform

Analysis of Core Technical Architecture

Implementation Details and System Integration

Application Prospects and Significance

Conclusion

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model