Zing Forum

Reading

Writing a Convolutional Neural Network from Scratch with CUDA: Deep Dive into GPU Parallel Computing and Deep Learning Underlying Implementations

A convolutional neural network project implemented entirely from scratch using CUDA without relying on any deep learning frameworks, demonstrating core principles of GPU parallel computing and performance optimization techniques

CUDA卷积神经网络GPU并行计算深度学习底层性能优化手写神经网络MNIST反向传播
Published 2026-05-13 17:26Recent activity 2026-05-13 17:29Estimated read 7 min
Writing a Convolutional Neural Network from Scratch with CUDA: Deep Dive into GPU Parallel Computing and Deep Learning Underlying Implementations
1

Section 01

Introduction: Writing a CNN from Scratch with CUDA, Deep Dive into GPU Parallelism and Deep Learning Underlying Layers

This article introduces the open-source project CUDA-CNN-from-scratch, created by developer claudiocamolese. It implements a convolutional neural network entirely from scratch using CUDA without relying on any deep learning frameworks. The project demonstrates core principles of GPU parallel computing and performance optimization techniques, supports MNIST and Fashion-MNIST datasets, and achieves a test accuracy of 98.08% after 5 training epochs. It is a practical resource for understanding the underlying implementation of deep learning.

2

Section 02

Project Background and Significance

The high abstraction of deep learning frameworks (such as PyTorch and TensorFlow) often causes developers to overlook the essence of underlying computations. When calling model.forward(), there is a lack of intuitive understanding of issues like how gradients flow on the GPU and how memory is managed. This project aims to answer these questions by implementing a CNN from scratch using CUDA, serving as a practical course on GPU parallel computing and deep learning underlying principles.

3

Section 03

Project Architecture and Network Design

The project implements a classic LeNet-style CNN:

  • First convolution layer: 1→16 channels, 3×3 kernel + ReLU
  • Second convolution layer:16→32 channels,3×3 kernel + ReLU
  • 2×2 max pooling layer
  • Flatten operation
  • Fully connected layer
  • Softmax layer It supports MNIST and Fashion-MNIST classification tasks, achieving a test accuracy of 98.08% after 5 training epochs.
4

Section 04

Core Implementation: Forward and Backward Propagation

All computation steps of the neural network are implemented manually: forward propagation includes convolution, activation, pooling, and fully connected layers; backward propagation calculates gradients for each layer and backpropagates to update parameters. The Stochastic Gradient Descent (SGD) optimizer is used, and all computations are performed on the GPU, helping to understand core concepts such as tensor operations, memory access patterns, and thread block partitioning.

5

Section 05

Performance Optimization Techniques

The project explores various GPU optimization techniques:

  1. Kernel Fusion: Fuse convolution and ReLU to reduce global memory read/write operations
  2. Shared Memory Tiling: Use shared memory for matrix multiplication in fully connected layers to reduce repeated access to global memory
  3. Warp-Level Reduction: Use the __shfl_down_sync instruction in the backward propagation convolution kernel to improve gradient calculation efficiency
  4. Triple Buffering and CUDA Streams: Overlap asynchronous transfer and computation to hide latency
  5. Pinned Memory: Enable cudaMemcpyAsync for asynchronous transfer to improve data movement efficiency
6

Section 06

Performance Comparison with PyTorch

The project conducts experiments comparing with PyTorch, analyzing the impact of parameters like num_workers (number of data loading subprocesses) and pin_memory (pinned memory). Although PyTorch's underlying layers are highly optimized, the handwritten CUDA version is still competitive after full optimization, and it provides full control over the computation process, allowing for extreme optimization for specific scenarios.

7

Section 07

Practical Value and Learning Significance

For learners of deep learning underlying layers, the project provides a rare resource:

  • Understand how CUDA kernels process convolutions in parallel
  • Master efficient GPU implementation of the chain rule in backward propagation
  • Comprehend the impact of memory management strategies on performance
  • Learn the effects and trade-offs of optimization techniques The code structure is clear, includes complete CMake configuration, supports compilation in multi-GPU environments, and is suitable for academic research, teaching, or getting started with high-performance computing.
8

Section 08

Conclusion

CUDA-CNN-from-scratch proves that without PyTorch/TensorFlow, a fully functional and high-performance deep learning system can be built using CUDA C++. The process of building from scratch deepens understanding of algorithm principles and cultivates engineering problem-solving abilities. In today's era of mature AI frameworks, understanding underlying implementations is a valuable investment in technical depth.