Zing Forum

Reading

Attention Mechanism-Based Deep Image Captioning Framework: A Visual Understanding System Combining ResNet and LSTM

An image captioning project for advanced machine learning courses, using an encoder-decoder architecture that combines ResNet-50 feature extraction, Bahdanau attention mechanism, and a two-layer LSTM to achieve context-aware automatic image caption generation.

图像描述生成注意力机制编码器解码器ResNetLSTM深度学习计算机视觉自然语言处理多模态学习TensorFlow
Published 2026-06-08 05:16Recent activity 2026-06-08 05:31Estimated read 6 min
Attention Mechanism-Based Deep Image Captioning Framework: A Visual Understanding System Combining ResNet and LSTM
1

Section 01

[Project Introduction] Attention Mechanism-Based Image Captioning Framework (ResNet+LSTM)

This project is the final project for the Advanced Machine Learning course developed by EyadMHussien, implementing a complete image captioning framework. Its core uses an encoder-decoder architecture, combining ResNet-50 feature extraction, Bahdanau attention mechanism, and a two-layer LSTM, which can dynamically focus on different regions of the image to generate contextually relevant descriptions. The project source code is available on GitHub (link: https://github.com/EyadMHussien/A-Deep-Learning-Framework-for-Image-Captioning-Course-Advanced-Machine-Learning), released on June 7, 2026.

2

Section 02

Project Background and Dataset Description

Image captioning is an interdisciplinary field of computer vision and natural language processing, aiming to enable machines to understand images and generate natural language descriptions. This project is based on the MS-COCO dataset (an industry-standard benchmark), sampling 50% of the data (41,391 images) to optimize training efficiency. Data preprocessing includes: converting text to lowercase, removing special characters, adding startseq/endseq tokens; building a vocabulary (top 5000 high-frequency words); padding sequences to 35 words to unify dimensions.

3

Section 03

Model Architecture and Training Methods

Model Architecture

  • Encoder: Pre-trained ResNet-50 (top classification layer removed), input 224×224 images, output spatial feature grid, transformed via Dense+ReLU.
  • Attention Mechanism: Bahdanau attention, calculates alignment scores between image features and decoder hidden states, generates context vectors, dynamically focuses on image regions.
  • Decoder: Word embedding layer (dimension 256) + two-layer LSTM (512 units per layer), concatenates context vector and word embedding as LSTM input, predicts next word probability.

Training Configuration

  • Custom training loop (accelerated with @tf.function), Adam optimizer (learning rate 0.001), sparse categorical cross-entropy loss (with masking).
  • Parameters: 10 training epochs, batch size 64, implements checkpoint resumption (saves weights to Google Drive).
4

Section 04

Model Evaluation and Inference Implementation

Inference Process: Input image → ResNet encoder extracts features → decoder generates words word by word (combining attention mechanism) → until endseq is generated or maximum length is reached. Visualization: Supports side-by-side display of original images and generated descriptions; can be extended to attention heatmaps (not yet implemented). Current Evaluation: Mainly relies on qualitative visualization, lacks quantitative metrics like BLEU and METEOR.

5

Section 05

Project Value and Technical Highlights

Teaching Value:

  • End-to-end implementation (data preprocessing → training → inference), helps understand deep learning engineering practices.
  • Manually implements Bahdanau attention, deepens understanding of mechanism principles.
  • Integrates CV (CNN) and NLP (RNN), demonstrates multimodal learning architecture.

Technical Highlights:

  • Transfer learning: Uses ImageNet pre-trained ResNet-50 to improve feature extraction capability.
  • Custom training loop: Flexible control over gradient calculation and loss handling.
  • Memory optimization: Vocabulary limitation, data sampling, etc., reduce memory usage.
6

Section 06

Limitations and Improvement Directions

Current Limitations:

  • Lacks quantitative evaluation metrics (BLEU/METEOR).
  • Attention heatmap visualization not implemented.
  • Only uses 50% of the dataset, which may limit performance.
  • Autoregressive inference speed is slow.

Improvement Directions:

  • Replace CNN with Transformer/Vision Transformer, or introduce BERT pre-trained language model.
  • Use CIDEr-D as reward to optimize description quality via reinforcement learning.
  • Use multimodal pre-trained models like CLIP to enhance visual-language alignment.
  • Introduce beam search decoding to improve generation quality.