# Attention Mechanism-Based Deep Image Captioning Framework: A Visual Understanding System Combining ResNet and LSTM

> An image captioning project for advanced machine learning courses, using an encoder-decoder architecture that combines ResNet-50 feature extraction, Bahdanau attention mechanism, and a two-layer LSTM to achieve context-aware automatic image caption generation.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-07T21:16:01.000Z
- 最近活动: 2026-06-07T21:31:02.314Z
- 热度: 145.8
- 关键词: 图像描述生成, 注意力机制, 编码器解码器, ResNet, LSTM, 深度学习, 计算机视觉, 自然语言处理, 多模态学习, TensorFlow
- 页面链接: https://www.zingnex.cn/en/forum/thread/resnetlstm
- Canonical: https://www.zingnex.cn/forum/thread/resnetlstm
- Markdown 来源: floors_fallback

---

## [Project Introduction] Attention Mechanism-Based Image Captioning Framework (ResNet+LSTM)

This project is the final project for the Advanced Machine Learning course developed by EyadMHussien, implementing a complete image captioning framework. Its core uses an encoder-decoder architecture, combining ResNet-50 feature extraction, Bahdanau attention mechanism, and a two-layer LSTM, which can dynamically focus on different regions of the image to generate contextually relevant descriptions. The project source code is available on GitHub (link: https://github.com/EyadMHussien/A-Deep-Learning-Framework-for-Image-Captioning-Course-Advanced-Machine-Learning), released on June 7, 2026.

## Project Background and Dataset Description

Image captioning is an interdisciplinary field of computer vision and natural language processing, aiming to enable machines to understand images and generate natural language descriptions. This project is based on the MS-COCO dataset (an industry-standard benchmark), sampling 50% of the data (41,391 images) to optimize training efficiency. Data preprocessing includes: converting text to lowercase, removing special characters, adding startseq/endseq tokens; building a vocabulary (top 5000 high-frequency words); padding sequences to 35 words to unify dimensions.

## Model Architecture and Training Methods

**Model Architecture**
- **Encoder**: Pre-trained ResNet-50 (top classification layer removed), input 224×224 images, output spatial feature grid, transformed via Dense+ReLU.
- **Attention Mechanism**: Bahdanau attention, calculates alignment scores between image features and decoder hidden states, generates context vectors, dynamically focuses on image regions.
- **Decoder**: Word embedding layer (dimension 256) + two-layer LSTM (512 units per layer), concatenates context vector and word embedding as LSTM input, predicts next word probability.

**Training Configuration**
- Custom training loop (accelerated with @tf.function), Adam optimizer (learning rate 0.001), sparse categorical cross-entropy loss (with masking).
- Parameters: 10 training epochs, batch size 64, implements checkpoint resumption (saves weights to Google Drive).

## Model Evaluation and Inference Implementation

**Inference Process**: Input image → ResNet encoder extracts features → decoder generates words word by word (combining attention mechanism) → until endseq is generated or maximum length is reached.
**Visualization**: Supports side-by-side display of original images and generated descriptions; can be extended to attention heatmaps (not yet implemented).
**Current Evaluation**: Mainly relies on qualitative visualization, lacks quantitative metrics like BLEU and METEOR.

## Project Value and Technical Highlights

**Teaching Value**:
- End-to-end implementation (data preprocessing → training → inference), helps understand deep learning engineering practices.
- Manually implements Bahdanau attention, deepens understanding of mechanism principles.
- Integrates CV (CNN) and NLP (RNN), demonstrates multimodal learning architecture.

**Technical Highlights**:
- Transfer learning: Uses ImageNet pre-trained ResNet-50 to improve feature extraction capability.
- Custom training loop: Flexible control over gradient calculation and loss handling.
- Memory optimization: Vocabulary limitation, data sampling, etc., reduce memory usage.

## Limitations and Improvement Directions

**Current Limitations**:
- Lacks quantitative evaluation metrics (BLEU/METEOR).
- Attention heatmap visualization not implemented.
- Only uses 50% of the dataset, which may limit performance.
- Autoregressive inference speed is slow.

**Improvement Directions**:
- Replace CNN with Transformer/Vision Transformer, or introduce BERT pre-trained language model.
- Use CIDEr-D as reward to optimize description quality via reinforcement learning.
- Use multimodal pre-trained models like CLIP to enhance visual-language alignment.
- Introduce beam search decoding to improve generation quality.
