Zing Forum

Reading

PCCX: An Open-Source NPU Architecture for Transformer Inference on Edge FPGAs

A hardware-software co-optimization framework designed specifically for Transformer large language model inference on edge devices. Targeting the KV260 development board, it addresses memory bandwidth bottlenecks via W4A8 quantization, a custom VLIW instruction set, and a split data path.

NPU架构FPGA加速Transformer推理边缘计算KV缓存量化推理VLIW指令集GEMV优化Xilinx KV260开源硬件
Published 2026-04-30 15:41Recent activity 2026-04-30 15:53Estimated read 5 min
PCCX: An Open-Source NPU Architecture for Transformer Inference on Edge FPGAs
1

Section 01

[Introduction] PCCX: An Open-Source NPU Architecture for Transformer Inference on Edge FPGAs

PCCX is a hardware-software co-optimization framework designed specifically for Transformer large language model inference on edge devices. Targeting the Xilinx KV260 development board, it addresses memory bandwidth bottlenecks via W4A8 quantization, a custom VLIW instruction set, and a split data path. Its core goal is to accelerate autoregressive decoding inference on resource-constrained edge devices.

2

Section 02

[Background] Challenges of Edge Transformer Inference and PCCX's Positioning

Edge devices have limited resources. During the autoregressive decoding phase of Transformers, only one token is processed at a time, making memory bandwidth-limited matrix-vector multiplication (GEMV) the performance bottleneck instead of compute-limited GEMM. PCCX selects the Xilinx Kria KV260 SOM as its target platform, with a design philosophy focused on solving the GEMV bottleneck, distinguishing it from general-purpose matrix accelerators.

3

Section 03

[Methodology] PCCX's Architectural Design and Core Components

PCCX uses a split data path to optimize matrix and vector operations:

  1. Three Core Units: GEMM (32×32 systolic array, 819 GMAC/s@400MHz), GEMV (4 cores ×32-MAC pipeline + reduction tree, 51.2 GMAC/s@400MHz), SFU/CVO (handles non-linear operations like Softmax);
  2. Key Decisions: W4A8 mixed-precision quantization (1 DSP = 2 MACs), custom 64-bit VLIW instruction set, 1.75MB shared URAM L2 cache, dual clock domains (control:250MHz / computation:400MHz).
4

Section 04

[Evidence] Memory Optimization and Performance Improvement Details

Memory Hierarchy: L1 (Block RAM), L2 (1.75MB shared URAM cache), weight stream (4 HP AXI ports), KV cache (off-chip); KV Cache Optimization: Mitigates the bandwidth bottleneck of 1.31GB cache under 32K context via INT8/INT4 quantization, attention eviction, and hard limit control; Version Evolution: v002 addresses v001's pain points (e.g., core separation, distributed HP ports, dual-MAC DSPs), achieving a 3.125x total throughput improvement.

5

Section 05

[Development & Ecosystem] Dual-Track Parallel Roadmap and Supporting Toolchain

Dual-Track Development: v002 (Gemma3N E4B, 20 tokens/s, Weeks 1-49) and v003 (Gemma4 E4B,12-15 tokens/s, Weeks16-52) are developed in parallel; Toolchain: pccx-FPGA-NPU-LLM-kv260 (RTL source code), pccx-lab (simulator and analyzer); Documentation: Supports English/Korean, including architecture overview, ISA reference, RTL source code, etc.

6

Section 06

[Conclusion] Project Significance and Open-Source Value of PCCX

PCCX is an important open-source contribution in the field of edge AI inference. It demonstrates a hardware-software co-design approach to solving deployment bottlenecks, providing a learning and reference platform for researchers/engineers. The dual-track development strategy is practical and efficient, offering valuable insights for complex hardware projects.