# Running Large Language Models on FPGA Bare-Metal: Analysis of pccx NPU's KV260 Implementation

> Exploring how the pccx FPGA NPU project achieves efficient LLM inference on the AMD Kria KV260 development board, covering key technical details such as W4A8 quantization, GEMM/GEMV data path design, and KV cache scheduling.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-02T04:13:45.000Z
- 最近活动: 2026-05-02T04:22:30.829Z
- 热度: 152.8
- 关键词: FPGA, LLM推理, NPU, 量化, Kria KV260, 边缘AI, SystemVerilog, GEMM, KV缓存
- 页面链接: https://www.zingnex.cn/en/forum/thread/fpga-pccx-npukv260
- Canonical: https://www.zingnex.cn/forum/thread/fpga-pccx-npukv260
- Markdown 来源: floors_fallback

---

## [Introduction] Analysis of pccx NPU's Bare-Metal LLM Execution on KV260

The pccx-FPGA-NPU-LLM-kv260 project is an open-source attempt to implement a dedicated Neural Processing Unit (NPU) via bare-metal on the AMD Kria KV260 development board, supporting efficient Large Language Model (LLM) inference. It covers key technologies such as W4A8 quantization, GEMM/GEMV data path design, and KV cache scheduling, providing a reference solution for edge AI deployment.

## Project Background and Motivation

With the popularization of LLMs, inference efficiency and hardware cost have become bottlenecks restricting deployment; traditional GPU solutions have high power consumption, high cost, and unstable supply chains. FPGAs have emerged as an important choice for edge AI inference due to their low latency, high energy efficiency, and programmability. The pccx project was born in this context to explore bare-metal NPU implementation on KV260 to support LLM inference.

## Overview of the Kria KV260 Hardware Platform

The KV260 is based on the Zynq UltraScale+ MPSoC architecture, integrating a quad-core ARM Cortex-A53, dual-core Cortex-R5, and a Programmable Logic (PL) section (approximately 1.3 million logic cells + abundant DSP resources). The heterogeneous architecture allows the PL to implement custom matrix operation acceleration engines, while the ARM runs lightweight scheduling software to achieve hardware-software co-optimization.

## Analysis of Core Technical Architecture

1. **W4A8 Quantization Strategy**: 4-bit weights compress the model size (reduced to 1/4~1/8), 8-bit activations ensure numerical stability, combined with dequantization/re-quantization logic; 2. **GEMM/GEMV Design**: GEMM uses systolic arrays to optimize batch multiplication in FFN layers, while GEMV is optimized for vector operations in attention mechanisms; 3. **KV Cache Scheduling**: Block management, on-chip/off-chip hierarchical storage (active cache resides in BRAM, historical data stored in DDR), and pipeline parallelism (overlap of computation and memory access).

## Implementation Details and System Integration

- **SystemVerilog RTL Design**: Modular division (computation core, memory controller, etc.), parameterized configuration (easy to port), synthesis-friendly (optimized timing and resource utilization); - **Driver Software**: Responsible for model loading (reading quantized weights from SD card/network), inference scheduling (coordinating CPU and NPU tasks), performance monitoring (latency/throughput/power consumption measurement), and supports bare-metal deployment.

## Application Prospects and Significance

- **Technical Validation**: Proves the feasibility of running modern LLMs on edge FPGAs; - **Cost Advantage**: The KV260 is affordable, providing a low-cost experimental platform for small and medium-sized enterprises/research institutions; - **Customization Potential**: Open-source RTL allows deep customization (specific models/quantization strategies); - **Energy Efficiency Benchmark**: FPGAs have better energy efficiency than general-purpose GPUs, suitable for battery-powered/heat-constrained scenarios.

## Conclusion

The pccx project represents the open-source hardware community's exploration in the field of AI acceleration, providing full-stack engineering practices from algorithm quantization to hardware architecture, RTL design to software drivers, making it a high-quality resource for learning AI chip design. With the development of LLM lightweighting, such open-source solutions are expected to play an important role in IoT, industrial intelligence, edge computing, and other fields.