# PCCX: An Open-Source NPU Architecture for Transformer Inference on Edge FPGAs

> A hardware-software co-optimization framework designed specifically for Transformer large language model inference on edge devices. Targeting the KV260 development board, it addresses memory bandwidth bottlenecks via W4A8 quantization, a custom VLIW instruction set, and a split data path.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-30T07:41:54.000Z
- 最近活动: 2026-04-30T07:53:15.520Z
- 热度: 145.8
- 关键词: NPU架构, FPGA加速, Transformer推理, 边缘计算, KV缓存, 量化推理, VLIW指令集, GEMV优化, Xilinx KV260, 开源硬件
- 页面链接: https://www.zingnex.cn/en/forum/thread/pccx-fpgatransformernpu
- Canonical: https://www.zingnex.cn/forum/thread/pccx-fpgatransformernpu
- Markdown 来源: floors_fallback

---

## [Introduction] PCCX: An Open-Source NPU Architecture for Transformer Inference on Edge FPGAs

PCCX is a hardware-software co-optimization framework designed specifically for Transformer large language model inference on edge devices. Targeting the Xilinx KV260 development board, it addresses memory bandwidth bottlenecks via W4A8 quantization, a custom VLIW instruction set, and a split data path. Its core goal is to accelerate autoregressive decoding inference on resource-constrained edge devices.

## [Background] Challenges of Edge Transformer Inference and PCCX's Positioning

Edge devices have limited resources. During the autoregressive decoding phase of Transformers, only one token is processed at a time, making memory bandwidth-limited matrix-vector multiplication (GEMV) the performance bottleneck instead of compute-limited GEMM. PCCX selects the Xilinx Kria KV260 SOM as its target platform, with a design philosophy focused on solving the GEMV bottleneck, distinguishing it from general-purpose matrix accelerators.

## [Methodology] PCCX's Architectural Design and Core Components

PCCX uses a split data path to optimize matrix and vector operations:
1. **Three Core Units**: GEMM (32×32 systolic array, 819 GMAC/s@400MHz), GEMV (4 cores ×32-MAC pipeline + reduction tree, 51.2 GMAC/s@400MHz), SFU/CVO (handles non-linear operations like Softmax);
2. **Key Decisions**: W4A8 mixed-precision quantization (1 DSP = 2 MACs), custom 64-bit VLIW instruction set, 1.75MB shared URAM L2 cache, dual clock domains (control:250MHz / computation:400MHz).

## [Evidence] Memory Optimization and Performance Improvement Details

**Memory Hierarchy**: L1 (Block RAM), L2 (1.75MB shared URAM cache), weight stream (4 HP AXI ports), KV cache (off-chip);
**KV Cache Optimization**: Mitigates the bandwidth bottleneck of 1.31GB cache under 32K context via INT8/INT4 quantization, attention eviction, and hard limit control;
**Version Evolution**: v002 addresses v001's pain points (e.g., core separation, distributed HP ports, dual-MAC DSPs), achieving a 3.125x total throughput improvement.

## [Development & Ecosystem] Dual-Track Parallel Roadmap and Supporting Toolchain

**Dual-Track Development**: v002 (Gemma3N E4B, 20 tokens/s, Weeks 1-49) and v003 (Gemma4 E4B,12-15 tokens/s, Weeks16-52) are developed in parallel;
**Toolchain**: pccx-FPGA-NPU-LLM-kv260 (RTL source code), pccx-lab (simulator and analyzer);
**Documentation**: Supports English/Korean, including architecture overview, ISA reference, RTL source code, etc.

## [Conclusion] Project Significance and Open-Source Value of PCCX

PCCX is an important open-source contribution in the field of edge AI inference. It demonstrates a hardware-software co-design approach to solving deployment bottlenecks, providing a learning and reference platform for researchers/engineers. The dual-track development strategy is practical and efficient, offering valuable insights for complex hardware projects.