Zing Forum

Reading

MELD.Raw: A Multimodal Sentiment Analysis Framework for English and Arabic Dialects

MELD.Raw is a deep learning framework that integrates three modalities—text, audio, and facial video—to support sentiment and emotion recognition for English and Arabic dialects. It implements three distinct architectures and has been evaluated on multiple benchmark datasets.

multimodalsentiment analysisemotion recognitionArabic NLPtransformercross-modal attentionCMU-MOSIMELD
Published 2026-04-06 05:02Recent activity 2026-04-06 05:25Estimated read 6 min
MELD.Raw: A Multimodal Sentiment Analysis Framework for English and Arabic Dialects
1

Section 01

MELD.Raw: A Multimodal Sentiment Analysis Framework for English and Arabic Dialects (Introduction)

MELD.Raw is a deep learning framework developed by Kareem Waly that integrates three modalities—text, audio, and facial video—to support sentiment and emotion recognition for English and Arabic dialects. The framework implements three complementary architectures and has been evaluated on CMU-MOSI, MELD, and a custom Arabic dataset. It not only provides high-performance English models but also reveals the challenges of low-resource Arabic multimodal research.

2

Section 02

Project Background and Research Motivation

Sentiment analysis is a key task in natural language processing, but text-only methods struggle to capture the full picture of human emotions. In daily communication, non-verbal cues like tone, speech rate, and facial expressions convey rich emotional information. Multimodal sentiment analysis addresses this by analyzing text, audio, and visual signals simultaneously. MELD.Raw focuses on supporting English and understudied Arabic dialects, aiming to explore effective multimodal fusion solutions.

3

Section 03

Three Architecture Designs

The project optimizes three architectures for different tasks and datasets:

  1. Enhanced Transformer Encoder (CMU-MOSI):Uses cross-modal attention mechanism. Text is processed with DeBERTa-v3-base, audio with Whisper-base, and video with ViT-base-patch16. It achieves 80.06% accuracy and 0.8012 F1 score on the CMU-MOSI test set.
  2. Dual-Task Projection Fusion Model (MELD):Handles 7-class emotion recognition and 3-class sentiment classification simultaneously. Modal features are mapped via linear projection layers then concatenated for fusion. Emotion classification accuracy is 62.87% and sentiment classification is 68.93%.
  3. Arabic Cross-Modal Transformer:Designed for Arabic dialects. Uses 4-head attention, label smoothing, and class-balanced loss to handle small datasets. Text is processed with Arabic BERT, audio with enhanced MFCC, and video with OpenCV+PCA dimensionality reduction.
4

Section 04

Datasets and Experimental Results

The framework was tested on three datasets:

Dataset Source Sample Count Modalities Language Best Results
CMU-MOSI CMU MultiComp Lab 2199 Text/Audio/Video English 80.06% accuracy, F1:0.8012
MELD SenticNet Lab 13707 Text/Audio/Video English Emotion:62.87%, Sentiment:68.93%
AMSAER Custom 412 Text/Audio/Video Arabic Dialect 39.68% accuracy, F1:0.3766
The performance of Arabic experiments is low mainly due to the small dataset size (only 288 training samples), revealing the bottleneck of insufficient Arabic multimodal corpora.
5

Section 05

Key Findings and Research Contributions

Key Findings

  • Cross-modal Transformer outperforms simple feature concatenation (as shown in CMU-MOSI results);
  • Dual-task learning (emotion + sentiment) is feasible and mutually beneficial;
  • Arabic multimodal NLP faces severe data shortage, and audio/visual cues are crucial for resolving text ambiguity. Contributions:Provides English-Arabic comparison benchmarks, validates the feasibility of dual-task learning, reveals challenges of low-resource languages, and offers complete reproducible code.
6

Section 06

Application Scenarios and Future Directions

Application Scenarios:Customer service quality monitoring (analyze dialogue text/tone/expressions), content moderation (identify negative emotions in videos), mental health screening (detect depression/anxiety signals), Arabic social media sentiment analysis. Future Directions:Collect larger Arabic multimodal corpora, explore semi-supervised/self-supervised learning to utilize unlabeled data, study English-Arabic cross-language transfer, optimize model efficiency for resource-constrained environments.