Zing Forum

Reading

Multimodal Emotion Recognition: A Comparative Study of Deep Learning Methods Fusing Visual and Speech Modalities

Comparative analysis of the performance of CNN, LSTM, GRU, and logistic regression in multimodal emotion recognition tasks, exploring best practices for fusing image and audio data.

多模态情感识别CNNLSTMGRU深度学习面部表情识别语音情感识别FER2013RAVDESSFastAPI
Published 2026-04-27 00:08Recent activity 2026-04-27 00:20Estimated read 3 min
Multimodal Emotion Recognition: A Comparative Study of Deep Learning Methods Fusing Visual and Speech Modalities
1

Section 01

Introduction to Multimodal Emotion Recognition Research

This article comparatively analyzes the performance of CNN, LSTM, GRU, and logistic regression in multimodal emotion recognition tasks, explores best practices for fusing image (FER2013) and audio (RAVDESS) data, covering core content such as model comparison, engineering implementation, and application scenarios.

2

Section 02

Technical Background and Dataset Description

Emotion recognition is an important branch of artificial intelligence. Multimodal approaches capture the complexity of emotions more accurately by fusing information such as facial expressions and speech. The image modality uses the FER2013 dataset (facial images of 7 basic emotions), and the audio modality uses the RAVDESS dataset (speech recordings of 8 emotions).

3

Section 03

Experimental Results of Model Architecture Comparison

Image Modality: Logistic regression accuracy is 18.04%, CNN improves to 58.00%; Audio Modality: Logistic regression: 65.79%, LSTM:52.26%, GRU:57.14%, 1D-CNN achieves the best result at 77.82%.

4

Section 04

Key Findings and Technical Insights

  1. Input representation determines model selection; 2. Audio modality has better recognition performance than image modality; 3. 1D-CNN outperforms RNN variants in audio tasks.
5

Section 05

Engineering Implementation and Application Scenarios

The project uses FastAPI for the backend and Next.js for the frontend, supporting real-time inference and visualization. Application scenarios include customer service, educational assistance, mental health monitoring, human-computer interaction, etc., but the generalization of the model in real scenarios needs to be verified.

6

Section 06

Practical Recommendations and Research Limitations

Recommendations: Explore modality fusion strategies, data augmentation, pre-trained models, and attention mechanisms. Limitations: The datasets are from laboratory environments, the model's generalization needs verification, and emotion recognition should be used as an auxiliary tool rather than an absolute judgment basis.