Zing Forum

Reading

Image Captioning Technology: Practice of Visual-Language Fusion with CNN-LSTM Architecture

This article introduces an image captioning system based on the CNN-LSTM architecture, explores cross-modal fusion technology between computer vision and natural language processing, analyzes model architecture design, training strategies, and evaluation methods, and discusses the application prospects of this technology in assisting visually impaired individuals, image retrieval, content understanding, and other fields.

图像描述生成CNNLSTM计算机视觉自然语言处理深度学习注意力机制编码器解码器多模态融合BLEU评估
Published 2026-04-01 08:42Recent activity 2026-04-01 08:51Estimated read 8 min
Image Captioning Technology: Practice of Visual-Language Fusion with CNN-LSTM Architecture
1

Section 01

[Introduction] Image Captioning Technology: Practice of Visual-Language Fusion with CNN-LSTM Architecture

This article focuses on image captioning technology based on the CNN-LSTM architecture, explores cross-modal fusion between computer vision and natural language processing, covers model architecture design, training strategies, evaluation methods, and application prospects, and provides a comprehensive perspective for understanding the fundamentals and development of this field.

2

Section 02

Technical Background and Core Challenges

Image Captioning is a classic task in the intersection of computer vision and natural language processing, aiming to generate accurate and fluent natural language descriptions for images. The technical challenge lies in simultaneously understanding visual content and linguistic semantic structures, and achieving effective modal alignment.

Its application value is extensive: assisting visually impaired individuals in understanding their environment; improving image retrieval accuracy; lowering the threshold for content creation and optimizing user experience.

3

Section 03

Design Principles of the CNN-LSTM Architecture

Encoder: Convolutional Neural Network (CNN)

The image captioning system adopts an encoder-decoder architecture. The encoder uses a pre-trained CNN (e.g., ResNet, VGG) to extract hierarchical visual features, and the output of the last convolutional layer serves as the semantic representation. Freezing parameters leverages the advantages of transfer learning.

Decoder: Long Short-Term Memory (LSTM)

The decoder uses LSTM to generate text, solving the gradient vanishing problem through a gating mechanism. The initial hidden/cell state is obtained by transforming visual features via a fully connected layer, and each word is generated by integrating the previous word, current state, and image features.

4

Section 04

Introduction and Optimization of Attention Mechanism

The basic CNN-LSTM has an information bottleneck due to fixed-length vectors. The attention mechanism allows it to dynamically focus on different regions of the image: at each decoding step, it calculates the correlation between the current state and the features of image regions, generates a weight distribution, and obtains a context vector through weighted summation, achieving the correspondence between words and regions (e.g., focusing on the animal region when generating the word "dog").

5

Section 05

Training Strategies and Loss Functions

Data Preparation and Preprocessing

Image-text paired datasets (e.g., Flickr8k, COCO Captions) are required. Text preprocessing includes building a vocabulary, converting words to indices, and sequence processing; image preprocessing includes size adjustment and pixel normalization.

Loss Function and Optimization

Cross-entropy loss is used to maximize the likelihood of correct sequences. Training techniques include:

  • Teacher Forcing: Using real previous words as input to accelerate convergence
  • Learning rate scheduling: Late decay for fine adjustment
  • Dropout regularization: Preventing overfitting
6

Section 06

Evaluation Metrics and Quality Measurement

Automatic evaluation metrics:

  • BLEU: Calculates n-gram overlap; BLEU-1 focuses on single words, BLEU-4 on four-word phrases
  • METEOR: Considers synonyms and stems, with better correlation than BLEU
  • ROUGE: Focuses on recall
  • CIDEr: Designed for image captions, weights rare n-grams
  • SPICE: Captures semantics based on scene graph matching

Automatic metrics are only approximate estimates; final judgment of accuracy, fluency, and relevance requires human evaluation.

7

Section 07

Application Scenarios and Social Value

Assisting Visually Impaired Individuals

Converting camera images to voice descriptions helps visually impaired users understand their environment (e.g., Microsoft Seeing AI, Google Lookout).

Image Retrieval and Content Management

Automatic descriptions serve as semantic indexes, improving the accuracy and recall of text-based image retrieval.

Content Creation Assistance

Generating image captions, alt text, etc., improves efficiency and ensures accessible access.

8

Section 08

Technical Limitations, Future Directions, and Summary

Limitations

  • Descriptions are general and lack details, tending to follow common patterns
  • Insufficient understanding of visual relationships (e.g., the action relationship of "riding")

Future Directions

  • Application of Transformer architecture (combining Vision Transformer with BERT/GPT)
  • Large-scale pre-training transfer (e.g., CLIP)
  • Controllable generation (specifying style, details)
  • Multi-modal fusion (combining audio and video)

Summary

The CNN-LSTM architecture is an important stage in image captioning technology. Its core ideas (encoder-decoder, attention, end-to-end training) remain the basic paradigm of the field and lay the foundation for subsequent models.