Zing Forum

Reading

MERS Multimodal Emotion Recognition System: A Deep Learning Approach Fusing Speech and Text

A multimodal emotion recognition framework based on the TESS dataset, evaluating emotion recognition performance through three experimental setups: Conv1D-BiLSTM audio modeling, BERT text representation, and late fusion network.

多模态学习情感识别深度学习BERTBiLSTM语音处理自然语言处理TESS数据集人工智能
Published 2026-05-24 16:10Recent activity 2026-05-24 16:25Estimated read 7 min
MERS Multimodal Emotion Recognition System: A Deep Learning Approach Fusing Speech and Text
1

Section 01

Introduction to the MERS Multimodal Emotion Recognition System

Core观点: MERS (Multimodal Emotion Recognition System) is a multimodal emotion recognition system fusing speech and text. Based on the TESS dataset, it verifies the advantages of multimodal methods through Conv1D-BiLSTM audio modeling, BERT text representation, and late fusion network, aiming to improve the accuracy and robustness of emotion recognition.

Project Source: Original author Rohan18999, published on GitHub (link: https://github.com/Rohan18999/emotion_detection), release date 2026-05-24.

2

Section 02

Project Background and Introduction to the TESS Dataset

Project Background and Motivation

Emotion recognition is a key technology for human-computer interaction and mental health monitoring. Traditional single-modal methods (only speech or text) struggle to fully capture human multimodal emotional expressions (acoustic + semantic clues). The MERS project explores fusing speech and text modalities to enhance recognition performance.

Introduction to the TESS Dataset

The Toronto Emotional Speech Set (TESS) is a benchmark dataset containing recordings of multiple actors reading sentences, covering seven emotions: anger, disgust, fear, happiness, neutral, sadness, and surprise. With high-quality samples and accurate annotations, it provides a reliable foundation for model training.

3

Section 03

Detailed Explanation of Three Experimental Architectures

MERS designs three experimental architectures:

  1. Speech Pipeline: Extracts MFCC features, uses Conv1D to capture local acoustic patterns, and BiLSTM to model temporal dependencies—suitable for processing speech sequence data.
  2. Text Pipeline: Based on the bert-base-uncased pre-trained model, captures semantic emotions via contextual embeddings, and is fine-tuned end-to-end on TESS labels.
  3. Late Fusion Network: Core innovation—after separate encoding of speech and text, features are concatenated, and joint decisions are made via fully connected layers, avoiding feature inconsistency issues in early fusion.
4

Section 04

Technical Highlights and Innovations of the Project

Technical Highlights

  1. Multimodal Complementarity: Speech captures "how to say" (tone, rhythm), while text captures "what to say" (semantics). Their combination enables accurate recognition of complex emotions like sarcasm.
  2. Modular Design: The three pipelines are independent yet unified, facilitating individual evaluation, component replacement, debugging, and optimization.
  3. Reproducibility: Provides a complete requirements.txt to ensure the reproducibility of experimental results.
5

Section 05

Experimental Results and Performance Inferences

Experimental Result Inferences

  • Single-modal Baselines: The speech pipeline performs well on emotions with obvious acoustic features (e.g., anger, surprise); the text pipeline excels at emotions with clear semantics.
  • Multimodal Improvement: Late fusion is expected to combine the advantages of both modalities, achieving better performance on confusing emotion categories (e.g., happiness vs. surprise).

Note: The GitHub repository does not provide detailed performance figures; the above are reasonable inferences based on architecture design.

6

Section 06

Application Scenarios and Potential Value

Application Scenarios

  1. Customer Service Analysis: Recognize customer emotions in real time, mark high-emotion calls, assist customer service in adjusting strategies, and analyze service quality.
  2. Mental Health Monitoring: Analyze patients' speech/text records, identify depression and anxiety, and provide emotional trends to support diagnosis.
  3. Content Moderation and Recommendation: Identify harmful emotional content, optimize recommendation algorithms, and improve platform ecology.
7

Section 07

Current Limitations and Future Improvement Directions

Current Limitations

  1. Dataset Limitation: TESS has a single scenario (fixed sentence reading), which differs from natural conversations;
  2. Language Limitation: Only supports English;
  3. Computational Cost: High inference cost for dual models.

Future Directions

  1. Integrate visual modality;
  2. Lightweight models to reduce deployment costs;
  3. Cross-language transfer;
  4. Optimize real-time processing latency.
8

Section 08

Project Summary and Insights

The MERS project demonstrates the potential of multimodal deep learning in emotion recognition and provides a clear benchmark and scalable framework.

Insights:

  • For Practitioners: Handling complex emotions requires multimodal fusion, respecting the essence of human emotional expression;
  • For Researchers: Modular design (starting with single-modal baselines, then fusion) helps understand component contributions and locate problems.