# Implementing CNN and RNN from Scratch: A Complete Deep Learning Project for Image Captioning

> This article introduces the Tubes2ML-17-k01 project, which implements Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN/LSTM) from scratch to build a complete Image Captioning system. All core components—including convolutional layers, pooling layers, LSTM units, and embedding layers—are built from scratch using NumPy, then verified and compared with equivalent Keras models.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-16T08:59:57.000Z
- 最近活动: 2026-05-16T09:08:44.937Z
- 热度: 163.8
- 关键词: 卷积神经网络, 循环神经网络, LSTM, 图像描述生成, 深度学习, 从底层实现, NumPy, Keras, 计算机视觉, 自然语言处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/cnn-rnn-48d81301
- Canonical: https://www.zingnex.cn/forum/thread/cnn-rnn-48d81301
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: Implementing CNN and RNN from Scratch: A Complete Deep Learning Project for Image Captioning

This article introduces the Tubes2ML-17-k01 project, which implements Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN/LSTM) from scratch to build a complete Image Captioning system. All core components—including convolutional layers, pooling layers, LSTM units, and embedding layers—are built from scratch using NumPy, then verified and compared with equivalent Keras models.

## Background: Challenges in Deep Learning from Theory to Practice

Deep learning courses usually start with mathematical formulas and theoretical derivations. Students understand the principles of backpropagation, gradient descent, and activation functions, but often lack the opportunity to translate these theories into practical code. Most practical projects directly use high-level APIs from TensorFlow or PyTorch, allowing a neural network to be built in just a few lines of code. While efficient, this approach masks the complexity of the underlying mechanisms.

The Tubes2ML-17-k01 project takes a different path: it requires the team to implement the core components of CNN and RNN from scratch, without relying on any built-in layers from deep learning frameworks. This means convolution operations, backpropagation, LSTM gating mechanisms, gradient checks—all of these need to be manually implemented using NumPy. Only after completing the scratch implementation does the team build an equivalent model using Keras for verification and comparison.

The educational value of this approach lies in forcing learners to truly understand the mathematical essence of each operation, rather than treating neural networks as a black box. This article will deeply analyze the technical implementation, architecture design, and experimental results of this project.

## Project Overview: End-to-End Image Captioning System

Image Captioning is an interdisciplinary task between computer vision and natural language processing—given an image, the model needs to generate a natural language description. This task requires understanding both visual content and language structure, making it a classic challenge in deep learning.

The Tubes2ML-17-k01 project builds a complete image captioning pipeline:

- **CNN Encoder**: Uses a convolutional neural network to extract visual features from images
- **RNN/LSTM Decoder**: Uses a recurrent neural network to generate descriptive text based on visual features
- **Scratch Implementation + Keras Verification**: Each component has both a scratch implementation and an equivalent Keras version

The project uses two datasets:
- **Intel Image Dataset**: For CNN image classification training
- **Flickr8k Dataset**: For training and evaluation of image captioning

Evaluation metrics include Macro F1 Score (for classification), BLEU-4, and METEOR (for captioning quality).

## Convolutional Layer (Conv2D)

The convolutional layer is the core component of CNN. In the scratch implementation, the team needed to manually implement the following operations:

1. **Forward Propagation**: Slide the convolution kernel over the input image, compute the dot product of local regions, and generate feature maps. This involves techniques like im2col to organize data efficiently.

2. **Weight Sharing Mechanism**: A key feature of convolutional layers is weight sharing—the same convolution kernel slides over the entire image. The project compared Conv2D (with weight sharing) and LocallyConnected2D (without weight sharing), analyzing the differences in parameter count and performance.

3. **Backpropagation**: Calculate the gradient of the loss with respect to the convolution kernel weights and input. This requires careful handling of the mathematical properties of convolution operations, including rotating the kernel 180 degrees for gradient convolution.

## ReLU Activation Layer

ReLU (Rectified Linear Unit) is the most commonly used activation function, with the formula f(x) = max(0, x). While forward propagation is simple, backpropagation requires correct gradient handling: for positive inputs, the gradient is 1; for negative inputs, the gradient is 0. The team implemented a complete gradient propagation mechanism.

## Max Pooling Layer

The max pooling layer reduces the spatial dimension of feature maps by selecting the maximum value within each local window. The scratch implementation requires:

- Recording the position of each maximum value (mask) during forward propagation
- Passing gradients only to the positions of maximum values during backpropagation, with gradients for other positions set to 0

## Flatten Layer

The Flatten layer converts multi-dimensional feature maps into one-dimensional vectors, serving as input for fully connected layers. While simple to implement, it requires correct handling of batch dimensions and gradient shapes.

## SimpleRNN Unit

Recurrent Neural Networks (RNN) pass information between time steps via hidden states. The forward propagation formula for a SimpleRNN unit is:

h_t = tanh(W_hh · h_{t-１} + W_xh · x_t + b)

Where W_hh is the weight matrix from hidden state to hidden state, and W_xh is the weight matrix from input to hidden state. Backpropagation requires the Backpropagation Through Time (BPTT) algorithm to accumulate gradients across all time steps.