正文

基于ResNet-50与LSTM的图像描述生成模型实现

一个使用经典编码器-解码器架构的图像描述项目，采用预训练ResNet-50提取图像特征，LSTM生成自然语言描述，在Flickr30k数据集上达到BLEU-4约0.21的成绩。

图像描述多模态学习ResNetLSTMPyTorch深度学习计算机视觉自然语言处理

发布时间 2026/04/18 18:56最近活动 2026/04/18 19:22预计阅读 5 分钟

章节 01

Project Overview: ResNet-50 + LSTM Image Captioning Model

This project implements a classic encoder-decoder image captioning model using pre-trained ResNet-50 for image feature extraction and LSTM for text generation. It achieves a BLEU-4 score of ~0.21 on the Flickr30k dataset. The project is an excellent starting point for learning multi-modal AI, covering core processes from data preprocessing to model evaluation.

章节 02

Background: Image Captioning & Its Significance

Image Captioning is a key multi-modal AI task that enables computers to describe images with natural language. It has applications in assisting visually impaired people, image retrieval, and social media content generation. This project uses a classic Seq2Seq architecture (retro but foundational) to demonstrate the core flow of the task.

章节 03

Model Architecture: Encoder-Decoder Design

Encoder: Uses pre-trained ResNet-50 (without final classification layer) to extract 2048D image features, then projects to 512D and generates initial LSTM states. Feature caching is used to reduce computation. Decoder: 2-layer LSTM with 256D word embedding, 512D hidden state. Training uses 70% teacher forcing; inference supports greedy search and beam search (k=5).

章节 04

Dataset & Training Strategy Details

Dataset: Flickr30k (31k images, 158k descriptions) split into train (25k images), val (3k), test (3k). Text preprocessing: lowercase, remove special chars, filter low-frequency words (vocab size:7731), add special tokens. Training: Loss is cross-entropy (ignore padding). Optimizer Adam (lr=3e-4), ReduceLROnPlateau (halve lr if val loss plateaus 3 epochs). Gradient clipping (max norm=5). Hyperparameters: batch size=64, epochs=20, etc. Best checkpoint at epoch14 (val loss=2.9270).

章节 05

Experimental Results & Analysis

Test Set Scores: BLEU-1=0.6139, BLEU-2=0.4323, BLEU-3=0.3049, BLEU-4=0.2107. Interpretation: BLEU-4 is competitive for non-attention Seq2Seq models (SOTA with attention is ~0.3+). Model captures high-level semantics but misses details (e.g., hair color, clothing). Examples show it can identify people/activities but lacks fine-grained details.

章节 06

Suggested Improvements for Better Performance

Architecture: Add spatial attention (Bahdanau/Luong), use stronger backbones (ResNet-101, ViT), fine-tune CNN. Training: Plan sampling (replace teacher forcing gradually), self-critical sequence training (optimize CIDEr/METEOR).

章节 07

Project Usage Guide & Final Summary

Usage: Upload notebook to Kaggle, attach Flickr30k, enable 2x T4 GPU, run cells (cache features once). Dependencies: torch, numpy, nltk, etc. Summary: This project is ideal for multi-modal learning beginners. It offers clear architecture, full implementation, detailed experiments, and improvement directions—focused on education rather than SOTA performance.