# Implementation of an Image Captioning Model Based on ResNet-50 and LSTM

> An image captioning project using the classic encoder-decoder architecture, which employs pre-trained ResNet-50 to extract image features and LSTM to generate natural language descriptions, achieving a BLEU-4 score of approximately 0.21 on the Flickr30k dataset.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-18T10:56:23.000Z
- 最近活动: 2026-04-18T11:22:56.151Z
- 热度: 150.6
- 关键词: 图像描述, 多模态学习, ResNet, LSTM, PyTorch, 深度学习, 计算机视觉, 自然语言处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/resnet-50lstm
- Canonical: https://www.zingnex.cn/forum/thread/resnet-50lstm
- Markdown 来源: floors_fallback

---

## Project Overview: ResNet-50 + LSTM Image Captioning Model

This project implements a classic encoder-decoder image captioning model using pre-trained ResNet-50 for image feature extraction and LSTM for text generation. It achieves a BLEU-4 score of ~0.21 on the Flickr30k dataset. The project is an excellent starting point for learning multi-modal AI, covering core processes from data preprocessing to model evaluation.

## Background: Image Captioning & Its Significance

Image Captioning is a key multi-modal AI task that enables computers to describe images with natural language. It has applications in assisting visually impaired people, image retrieval, and social media content generation. This project uses a classic Seq2Seq architecture (retro but foundational) to demonstrate the core flow of the task.

## Model Architecture: Encoder-Decoder Design

**Encoder**: Uses pre-trained ResNet-50 (without final classification layer) to extract 2048D image features, then projects to 512D and generates initial LSTM states. Feature caching is used to reduce computation.
**Decoder**: 2-layer LSTM with 256D word embedding, 512D hidden state. Training uses 70% teacher forcing; inference supports greedy search and beam search (k=5).

## Dataset & Training Strategy Details

**Dataset**: Flickr30k (31k images, 158k descriptions) split into train (25k images), val (3k), test (3k). Text preprocessing: lowercase, remove special chars, filter low-frequency words (vocab size:7731), add special tokens.
**Training**: Loss is cross-entropy (ignore padding). Optimizer Adam (lr=3e-4), ReduceLROnPlateau (halve lr if val loss plateaus 3 epochs). Gradient clipping (max norm=5). Hyperparameters: batch size=64, epochs=20, etc. Best checkpoint at epoch14 (val loss=2.9270).

## Experimental Results & Analysis

**Test Set Scores**: BLEU-1=0.6139, BLEU-2=0.4323, BLEU-3=0.3049, BLEU-4=0.2107.
**Interpretation**: BLEU-4 is competitive for non-attention Seq2Seq models (SOTA with attention is ~0.3+). Model captures high-level semantics but misses details (e.g., hair color, clothing). Examples show it can identify people/activities but lacks fine-grained details.

## Suggested Improvements for Better Performance

**Architecture**: Add spatial attention (Bahdanau/Luong), use stronger backbones (ResNet-101, ViT), fine-tune CNN.
**Training**: Plan sampling (replace teacher forcing gradually), self-critical sequence training (optimize CIDEr/METEOR).

## Project Usage Guide & Final Summary

**Usage**: Upload notebook to Kaggle, attach Flickr30k, enable 2x T4 GPU, run cells (cache features once). Dependencies: torch, numpy, nltk, etc.
**Summary**: This project is ideal for multi-modal learning beginners. It offers clear architecture, full implementation, detailed experiments, and improvement directions—focused on education rather than SOTA performance.