Zing Forum

Reading

Implementation of an Image Captioning Model Based on ResNet-50 and LSTM

An image captioning project using the classic encoder-decoder architecture, which employs pre-trained ResNet-50 to extract image features and LSTM to generate natural language descriptions, achieving a BLEU-4 score of approximately 0.21 on the Flickr30k dataset.

图像描述多模态学习ResNetLSTMPyTorch深度学习计算机视觉自然语言处理
Published 2026-04-18 18:56Recent activity 2026-04-18 19:22Estimated read 5 min
Implementation of an Image Captioning Model Based on ResNet-50 and LSTM
1

Section 01

Project Overview: ResNet-50 + LSTM Image Captioning Model

This project implements a classic encoder-decoder image captioning model using pre-trained ResNet-50 for image feature extraction and LSTM for text generation. It achieves a BLEU-4 score of ~0.21 on the Flickr30k dataset. The project is an excellent starting point for learning multi-modal AI, covering core processes from data preprocessing to model evaluation.

2

Section 02

Background: Image Captioning & Its Significance

Image Captioning is a key multi-modal AI task that enables computers to describe images with natural language. It has applications in assisting visually impaired people, image retrieval, and social media content generation. This project uses a classic Seq2Seq architecture (retro but foundational) to demonstrate the core flow of the task.

3

Section 03

Model Architecture: Encoder-Decoder Design

Encoder: Uses pre-trained ResNet-50 (without final classification layer) to extract 2048D image features, then projects to 512D and generates initial LSTM states. Feature caching is used to reduce computation. Decoder: 2-layer LSTM with 256D word embedding, 512D hidden state. Training uses 70% teacher forcing; inference supports greedy search and beam search (k=5).

4

Section 04

Dataset & Training Strategy Details

Dataset: Flickr30k (31k images, 158k descriptions) split into train (25k images), val (3k), test (3k). Text preprocessing: lowercase, remove special chars, filter low-frequency words (vocab size:7731), add special tokens. Training: Loss is cross-entropy (ignore padding). Optimizer Adam (lr=3e-4), ReduceLROnPlateau (halve lr if val loss plateaus 3 epochs). Gradient clipping (max norm=5). Hyperparameters: batch size=64, epochs=20, etc. Best checkpoint at epoch14 (val loss=2.9270).

5

Section 05

Experimental Results & Analysis

Test Set Scores: BLEU-1=0.6139, BLEU-2=0.4323, BLEU-3=0.3049, BLEU-4=0.2107. Interpretation: BLEU-4 is competitive for non-attention Seq2Seq models (SOTA with attention is ~0.3+). Model captures high-level semantics but misses details (e.g., hair color, clothing). Examples show it can identify people/activities but lacks fine-grained details.

6

Section 06

Suggested Improvements for Better Performance

Architecture: Add spatial attention (Bahdanau/Luong), use stronger backbones (ResNet-101, ViT), fine-tune CNN. Training: Plan sampling (replace teacher forcing gradually), self-critical sequence training (optimize CIDEr/METEOR).

7

Section 07

Project Usage Guide & Final Summary

Usage: Upload notebook to Kaggle, attach Flickr30k, enable 2x T4 GPU, run cells (cache features once). Dependencies: torch, numpy, nltk, etc. Summary: This project is ideal for multi-modal learning beginners. It offers clear architecture, full implementation, detailed experiments, and improvement directions—focused on education rather than SOTA performance.