Zing Forum

Reading

Pipelined Binary-Weight LeNet-5 Accelerator on Pynq-Z2: An XNOR-POPcount Optimization Scheme

This article introduces a project of a pipelined binary-weight LeNet-5 convolutional neural network accelerator implemented on the Pynq-Z2 FPGA platform. The project uses XNOR-POPcount technology to optimize computational efficiency, focuses on the MNIST handwritten digit recognition task, and demonstrates a hardware acceleration solution for edge AI inference.

FPGAPynq-Z2LeNet-5二值神经网络XNOR-POPcount硬件加速MNIST边缘计算卷积神经网络流水线架构
Published 2026-06-06 18:43Recent activity 2026-06-06 18:51Estimated read 9 min
Pipelined Binary-Weight LeNet-5 Accelerator on Pynq-Z2: An XNOR-POPcount Optimization Scheme
1

Section 01

【Introduction】Core Overview of the Pipelined Binary LeNet-5 Accelerator on Pynq-Z2

This project was developed by bharathkrishna0 and released on GitHub (link: https://github.com/bharathkrishna0/Pipelined-Binary-Weight-LeNet-5-Accelerator-on-Pynq-Z2-with-XNOR-POPcount, release date: 2026-06-06). The core is a pipelined binary-weight LeNet-5 convolutional neural network accelerator implemented on the Pynq-Z2 FPGA platform, using XNOR-POPcount technology to optimize computational efficiency. It targets the MNIST handwritten digit recognition task and demonstrates a hardware acceleration solution for edge AI inference.

2

Section 02

Project Background and Motivation

With the popularization of AI applications, efficiently running neural networks on resource-constrained edge devices has become a challenge. Traditional floating-point neural networks have high accuracy but large computational and memory footprints, making real-time operation on embedded FPGAs difficult. Binary Neural Networks (BNNs) quantize weights and activations to ±1, significantly reducing computational complexity and enabling efficient inference on FPGAs. Against this background, this project chooses the Pynq-Z2 development board to implement a pipelined accelerator based on LeNet-5, and uses XNOR-POPcount to optimize convolution efficiency.

3

Section 03

LeNet-5 Architecture and Binary Transformation

LeNet-5 is a classic CNN architecture proposed by Yann LeCun in 1998 for handwritten digit recognition. It includes 2 convolutional layers, 2 pooling layers, and 3 fully connected layers, with a small number of parameters suitable for FPGA acceleration beginners. This project performs binary transformation on it: compressing 32-bit floating-point weights into single-bit binary weights, reducing storage requirements by 32 times, and simplifying multiplication operations to XNOR operations, which are combined with POPcount to complete convolution calculations.

4

Section 04

XNOR-POPcount Optimization Principle

XNOR-POPcount is a core optimization technology for BNNs. Traditional convolution requires a large number of multiply-accumulate (MAC) operations, consuming a lot of logical resources. In BNNs, weights and activations are ±1, so multiplication is equivalent to XNOR: if the signs are the same, XNOR is 1; otherwise, it is 0. XNOR can be implemented with very few LUTs in FPGAs, which is much lower than the overhead of multipliers. After completing XNOR, counting the number of 1s (POPcount) gives the convolution result. Modern FPGAs have built-in dedicated POPcount units that can complete multi-bit counting in a single cycle, improving efficiency.

5

Section 05

Advantages of Pynq-Z2 Platform and Pipelined Architecture Design

The Pynq-Z2 is based on the Xilinx Zynq-7020 SoC, integrating a dual-core ARM Cortex-A9 and programmable logic (PL), making it suitable for software-hardware co-acceleration: the PS side is responsible for loading inputs/weights, running Python code, and preprocessing/postprocessing; the PL side implements a pipelined convolution engine, executes XNOR-POPcount operations, and provides high parallel computing. Pynq's Python Overlay mechanism lowers the development threshold. The project uses a pipelined design: different layers process different samples in parallel (e.g., the first layer processes sample N, the second layer processes sample N-1), improving hardware utilization and approaching theoretical peak throughput. It is necessary to balance delays at all levels, handle data dependencies and timing constraints, and use a quantization recovery mechanism to deal with accuracy loss.

6

Section 06

MNIST Dataset Validation and Performance

The project uses the MNIST handwritten digit dataset (60,000 training images, 10,000 test images, 28×28 grayscale) for validation. Although the accuracy of the binary LeNet-5 on MNIST is lower than that of the floating-point version, it usually maintains an accuracy of over 95%, meeting the needs of edge scenarios; and it achieves millisecond-level inference latency on Pynq-Z2, meeting real-time requirements.

7

Section 07

Practical Application Value and Development Insights

This project has reference value for multiple scenarios: industrial visual inspection (production line defect identification/part classification, reducing hardware costs), intelligent security (face/behavior recognition on edge cameras, reducing bandwidth and privacy risks), and IoT devices (local inference on battery-powered nodes, extending battery life). For FPGA acceleration beginners, it provides a complete reference implementation: HDL code writing, Pynq Python interface encapsulation, simulation verification to on-board testing, covering the entire hardware acceleration process.

8

Section 08

Project Summary and Technical Key Points

Key technical implementation points: 1. Binary training requires special strategies (such as STE for gradient propagation); 2. Data flow optimization (on-chip caching reduces external access and maximizes data reuse); 3. Bit-width design (intermediate feature maps and accumulation results need sufficient bit width to prevent overflow); 4. Software-hardware collaboration (ARM handles control-intensive tasks, FPGA handles computation-intensive tasks). The project successfully combines BNN and FPGA platforms, and through XNOR-POPcount and pipelined design, achieves efficient computing while maintaining accuracy, providing a cost-effective path for edge AI, and also providing a practical case for hardware acceleration research and teaching. Future AI chip development will promote more applications of such collaborative optimization ideas, accelerating the shift of AI from the cloud to the edge.