Reading

Implementing CNN and RNN from Scratch: A Complete Deep Learning Project for Image Captioning

This article introduces the Tubes2ML-17-k01 project, which implements Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN/LSTM) from scratch to build a complete Image Captioning system. All core components—including convolutional layers, pooling layers, LSTM units, and embedding layers—are built from scratch using NumPy, then verified and compared with equivalent Keras models.

卷积神经网络循环神经网络LSTM图像描述生成深度学习从底层实现NumPyKeras计算机视觉自然语言处理

Published 2026-05-16 16:59Recent activity 2026-05-16 17:08Estimated read 9 min

Section 01

Introduction / Main Post: Implementing CNN and RNN from Scratch: A Complete Deep Learning Project for Image Captioning

Section 02

Background: Challenges in Deep Learning from Theory to Practice

Deep learning courses usually start with mathematical formulas and theoretical derivations. Students understand the principles of backpropagation, gradient descent, and activation functions, but often lack the opportunity to translate these theories into practical code. Most practical projects directly use high-level APIs from TensorFlow or PyTorch, allowing a neural network to be built in just a few lines of code. While efficient, this approach masks the complexity of the underlying mechanisms.

The Tubes2ML-17-k01 project takes a different path: it requires the team to implement the core components of CNN and RNN from scratch, without relying on any built-in layers from deep learning frameworks. This means convolution operations, backpropagation, LSTM gating mechanisms, gradient checks—all of these need to be manually implemented using NumPy. Only after completing the scratch implementation does the team build an equivalent model using Keras for verification and comparison.

The educational value of this approach lies in forcing learners to truly understand the mathematical essence of each operation, rather than treating neural networks as a black box. This article will deeply analyze the technical implementation, architecture design, and experimental results of this project.

Section 03

Project Overview: End-to-End Image Captioning System

Image Captioning is an interdisciplinary task between computer vision and natural language processing—given an image, the model needs to generate a natural language description. This task requires understanding both visual content and language structure, making it a classic challenge in deep learning.

The Tubes2ML-17-k01 project builds a complete image captioning pipeline:

CNN Encoder: Uses a convolutional neural network to extract visual features from images
RNN/LSTM Decoder: Uses a recurrent neural network to generate descriptive text based on visual features
Scratch Implementation + Keras Verification: Each component has both a scratch implementation and an equivalent Keras version

The project uses two datasets:

Intel Image Dataset: For CNN image classification training
Flickr8k Dataset: For training and evaluation of image captioning

Evaluation metrics include Macro F1 Score (for classification), BLEU-4, and METEOR (for captioning quality).

Section 04

Convolutional Layer (Conv2D)

The convolutional layer is the core component of CNN. In the scratch implementation, the team needed to manually implement the following operations:

Forward Propagation: Slide the convolution kernel over the input image, compute the dot product of local regions, and generate feature maps. This involves techniques like im2col to organize data efficiently.
Weight Sharing Mechanism: A key feature of convolutional layers is weight sharing—the same convolution kernel slides over the entire image. The project compared Conv2D (with weight sharing) and LocallyConnected2D (without weight sharing), analyzing the differences in parameter count and performance.
Backpropagation: Calculate the gradient of the loss with respect to the convolution kernel weights and input. This requires careful handling of the mathematical properties of convolution operations, including rotating the kernel 180 degrees for gradient convolution.

Section 05

ReLU Activation Layer

ReLU (Rectified Linear Unit) is the most commonly used activation function, with the formula f(x) = max(0, x). While forward propagation is simple, backpropagation requires correct gradient handling: for positive inputs, the gradient is 1; for negative inputs, the gradient is 0. The team implemented a complete gradient propagation mechanism.

Section 06

Max Pooling Layer

The max pooling layer reduces the spatial dimension of feature maps by selecting the maximum value within each local window. The scratch implementation requires:

Recording the position of each maximum value (mask) during forward propagation
Passing gradients only to the positions of maximum values during backpropagation, with gradients for other positions set to 0

Section 07

Flatten Layer

The Flatten layer converts multi-dimensional feature maps into one-dimensional vectors, serving as input for fully connected layers. While simple to implement, it requires correct handling of batch dimensions and gradient shapes.

Section 08

SimpleRNN Unit

Recurrent Neural Networks (RNN) pass information between time steps via hidden states. The forward propagation formula for a SimpleRNN unit is:

h_t = tanh(W_hh · h_{t-１} + W_xh · x_t + b)

Where W_hh is the weight matrix from hidden state to hidden state, and W_xh is the weight matrix from input to hidden state. Backpropagation requires the Backpropagation Through Time (BPTT) algorithm to accumulate gradients across all time steps.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54