Zing Forum

Reading

Implementing CNN and RNN from Scratch: A Complete Deep Learning Project for Image Captioning

This article introduces the Tubes2ML-17-k01 project, which implements Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN/LSTM) from scratch to build a complete Image Captioning system. All core components—including convolutional layers, pooling layers, LSTM units, and embedding layers—are built from scratch using NumPy, then verified and compared with equivalent Keras models.

卷积神经网络循环神经网络LSTM图像描述生成深度学习从底层实现NumPyKeras计算机视觉自然语言处理
Published 2026-05-16 16:59Recent activity 2026-05-16 17:08Estimated read 9 min
Implementing CNN and RNN from Scratch: A Complete Deep Learning Project for Image Captioning
1

Section 01

Introduction / Main Post: Implementing CNN and RNN from Scratch: A Complete Deep Learning Project for Image Captioning

This article introduces the Tubes2ML-17-k01 project, which implements Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN/LSTM) from scratch to build a complete Image Captioning system. All core components—including convolutional layers, pooling layers, LSTM units, and embedding layers—are built from scratch using NumPy, then verified and compared with equivalent Keras models.

2

Section 02

Background: Challenges in Deep Learning from Theory to Practice

Deep learning courses usually start with mathematical formulas and theoretical derivations. Students understand the principles of backpropagation, gradient descent, and activation functions, but often lack the opportunity to translate these theories into practical code. Most practical projects directly use high-level APIs from TensorFlow or PyTorch, allowing a neural network to be built in just a few lines of code. While efficient, this approach masks the complexity of the underlying mechanisms.

The Tubes2ML-17-k01 project takes a different path: it requires the team to implement the core components of CNN and RNN from scratch, without relying on any built-in layers from deep learning frameworks. This means convolution operations, backpropagation, LSTM gating mechanisms, gradient checks—all of these need to be manually implemented using NumPy. Only after completing the scratch implementation does the team build an equivalent model using Keras for verification and comparison.

The educational value of this approach lies in forcing learners to truly understand the mathematical essence of each operation, rather than treating neural networks as a black box. This article will deeply analyze the technical implementation, architecture design, and experimental results of this project.

3

Section 03

Project Overview: End-to-End Image Captioning System

Image Captioning is an interdisciplinary task between computer vision and natural language processing—given an image, the model needs to generate a natural language description. This task requires understanding both visual content and language structure, making it a classic challenge in deep learning.

The Tubes2ML-17-k01 project builds a complete image captioning pipeline:

  • CNN Encoder: Uses a convolutional neural network to extract visual features from images
  • RNN/LSTM Decoder: Uses a recurrent neural network to generate descriptive text based on visual features
  • Scratch Implementation + Keras Verification: Each component has both a scratch implementation and an equivalent Keras version

The project uses two datasets:

  • Intel Image Dataset: For CNN image classification training
  • Flickr8k Dataset: For training and evaluation of image captioning

Evaluation metrics include Macro F1 Score (for classification), BLEU-4, and METEOR (for captioning quality).

4

Section 04

Convolutional Layer (Conv2D)

The convolutional layer is the core component of CNN. In the scratch implementation, the team needed to manually implement the following operations:

  1. Forward Propagation: Slide the convolution kernel over the input image, compute the dot product of local regions, and generate feature maps. This involves techniques like im2col to organize data efficiently.

  2. Weight Sharing Mechanism: A key feature of convolutional layers is weight sharing—the same convolution kernel slides over the entire image. The project compared Conv2D (with weight sharing) and LocallyConnected2D (without weight sharing), analyzing the differences in parameter count and performance.

  3. Backpropagation: Calculate the gradient of the loss with respect to the convolution kernel weights and input. This requires careful handling of the mathematical properties of convolution operations, including rotating the kernel 180 degrees for gradient convolution.

5

Section 05

ReLU Activation Layer

ReLU (Rectified Linear Unit) is the most commonly used activation function, with the formula f(x) = max(0, x). While forward propagation is simple, backpropagation requires correct gradient handling: for positive inputs, the gradient is 1; for negative inputs, the gradient is 0. The team implemented a complete gradient propagation mechanism.

6

Section 06

Max Pooling Layer

The max pooling layer reduces the spatial dimension of feature maps by selecting the maximum value within each local window. The scratch implementation requires:

  • Recording the position of each maximum value (mask) during forward propagation
  • Passing gradients only to the positions of maximum values during backpropagation, with gradients for other positions set to 0
7

Section 07

Flatten Layer

The Flatten layer converts multi-dimensional feature maps into one-dimensional vectors, serving as input for fully connected layers. While simple to implement, it requires correct handling of batch dimensions and gradient shapes.

8

Section 08

SimpleRNN Unit

Recurrent Neural Networks (RNN) pass information between time steps via hidden states. The forward propagation formula for a SimpleRNN unit is:

h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b)

Where W_hh is the weight matrix from hidden state to hidden state, and W_xh is the weight matrix from input to hidden state. Backpropagation requires the Backpropagation Through Time (BPTT) algorithm to accumulate gradients across all time steps.