Zing Forum

Reading

PyStacks: A Modular CUDA Neural Network Library Built from Scratch

This article introduces PyStacks, a CUDA neural network library written entirely from scratch. The author uses TensorFlow and CuPy for GPU acceleration, adopts a modular design similar to Keras, and supports features like YOLO-style object detection, custom Concat graph optimization, and full training state saving. It is an excellent learning project for understanding the underlying principles of deep learning.

神经网络CUDACuPy深度学习YOLO目标检测反向传播GPU加速TensorFlow模块化设计
Published 2026-05-26 06:15Recent activity 2026-05-26 06:23Estimated read 5 min
PyStacks: A Modular CUDA Neural Network Library Built from Scratch
1

Section 01

Introduction / Main Post: PyStacks: A Modular CUDA Neural Network Library Built from Scratch

This article introduces PyStacks, a CUDA neural network library written entirely from scratch. The author uses TensorFlow and CuPy for GPU acceleration, adopts a modular design similar to Keras, and supports features like YOLO-style object detection, custom Concat graph optimization, and full training state saving. It is an excellent learning project for understanding the underlying principles of deep learning.

2

Section 02

Original Author and Source


3

Section 03

Project Motivation and Background

Over the past year, the author has been continuously developing neural network projects, with the earliest project starting in April 2024. Through extensive trial-and-error learning, the final goal was to build a bounding box regression model similar to YOLO-v5.

The early code had many limitations and efficiency issues. Therefore, the author completely rewrote it using a modular design approach, similar to Keras' Sequential API. This architecture allows developers to quickly add new features (such as batch normalization, CSP blocks, multi-scale detection heads) without having to rewrite the entire system every time.

Important Note: This is not a replacement for TensorFlow or PyTorch. The author created it to understand what actually happens inside the "black box". If you value your time, do not use it in serious projects.


4

Section 04

Fully Modular Layer System

Build models by combining layers like Keras or PyTorch, but every forward pass, backward pass, and weight update is written manually—no automatic differentiation shortcuts.

5

Section 05

YOLO-style Object Detection

Supports multi-scale anchor-based detection, including custom loss functions:

  • DIoU, CIoU, SIoU: Improved bounding box regression losses
  • Focal Loss: Addresses foreground-background class imbalance
  • BCE: Binary cross-entropy classification loss
  • Contraction Loss: Applies contraction loss to inactive anchors to prevent bounding box explosion during early training
6

Section 06

Concat Graph Optimization

Skip connections use the ConcatStartPoint / ConcatResidualStartPoint / ConcatEndPoint system to avoid storing redundant intermediate activation values. This is an optimization scheme independently designed by the author after the naive implementation consumed too much memory.

7

Section 07

Custom Training Loop

  • Gradient accumulation
  • AutoClipper: Adaptive gradient clipping (from original paper)
  • Learning rate scheduler
  • Full optimizer state saving and restoration
8

Section 08

GPU Acceleration

Uses TensorFlow (v1 compatibility mode) and CuPy for GPU computing, with manual memory management.