# Practical Multimodal Emotion Recognition: How Audio-Text Fusion Achieves 47.92% Accuracy

> An open-source multimodal emotion recognition project demonstrates how to combine audio CNN, Whisper speech transcription, and DistilBERT text model to achieve 47.92% recognition accuracy on the RAVDESS dataset using a late fusion strategy, providing a complete engineering implementation reference for speech emotion analysis.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-12T18:53:21.000Z
- 最近活动: 2026-05-12T19:20:37.254Z
- 热度: 163.6
- 关键词: 多模态情感识别, 语音情感分析, RAVDESS数据集, 音频CNN, Whisper, DistilBERT, 后期融合, 数据增强, 机器学习, 深度学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/47-92
- Canonical: https://www.zingnex.cn/forum/thread/47-92
- Markdown 来源: floors_fallback

---

## Introduction to Practical Multimodal Emotion Recognition: Audio-Text Fusion Achieving 47.92% Accuracy

This open-source project demonstrates how to combine audio CNN, Whisper speech transcription, and DistilBERT text model to achieve 47.92% emotion recognition accuracy on the RAVDESS dataset using a late fusion strategy. The project systematically compares unimodal and multimodal methods, adopts an actor-split evaluation approach, and provides a complete engineering implementation reference for speech emotion analysis.

## Challenges of Multimodal Emotion Recognition and Characteristics of the RAVDESS Dataset

Human emotion expression is the result of multimodal collaboration, and traditional unimodal methods (acoustic-only or text-only) have limitations. The RAVDESS dataset is a benchmark for emotion recognition, containing 8 emotion samples from 24 actors. Fixed scripts ensure that emotional information mainly comes from acoustic features, providing a controlled scenario for multimodal research.

## Three-Branch Modular Architecture and Late Fusion Strategy

The project uses three branches: 1. Audio CNN branch: Convert speech to Mel spectrogram, and learn time-frequency patterns via CNN; 2. Text RNN branch: Transcribe speech using Whisper tiny.en, processed by bidirectional GRU; 3. DistilBERT branch: An additional experiment replacing GRU. After independent training of the three branches, predictions are combined via late fusion (weighted average probability).

## Experimental Results: Data Augmentation Boosts Performance Significantly, Fusion Reaches 47.92% Accuracy

Baseline audio CNN accuracy: 38.33%; Text GRU only: 16.25%; Audio CNN after data augmentation: 46.67%; Enhanced audio + DistilBERT fusion final accuracy: 47.92%, Macro F1: 44.38%. Data augmentation plays a significant role in performance improvement.

## Actor-Split Evaluation: More Realistically Reflects Model Generalization Ability

The project uses actor-split (speakers in the test set are not in the training set) instead of random split, forcing the model to learn general emotional features, avoiding memorizing specific actor styles, being closer to real application scenarios, and reflecting the rigor of evaluation.

## Fusion Strategy Experiments: Simple Average Performs Well

Three strategies were tried: average probability, weighted average, and maximum confidence. The results show that the simple average strategy performs stably on the RAVDESS dataset without the need for complex dynamic weight adjustment.

## Practical Application Insights and Reference Value of the Open-Source Project

Data augmentation can improve the robustness of audio models; the value of multimodal fusion depends on the complementarity of modal information; the open-source project provides complete code and documentation, facilitating reproduction and entry, and offering methodological references for similar studies.

## Project Summary and Future Outlook for Emotion Recognition

This project demonstrates the potential of multimodal fusion; the 47.92% accuracy is quite good considering the dataset difficulty and actor-split challenges. Open-source accumulation will accelerate the development of emotion recognition technology, and future multimodal large models are expected to further improve accuracy and robustness.