# Multimodal Emotion Recognition: Practical Exploration of ResNet-50 and CLIP Fusion

> This article introduces a multimodal emotion recognition framework combining ResNet-50 visual features with CLIP text embeddings, using a late fusion strategy, providing a practical reference for cross-modal learning.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-26T17:11:02.000Z
- 最近活动: 2026-05-26T17:22:03.076Z
- 热度: 148.8
- 关键词: 多模态学习, 情感识别, ResNet-50, CLIP, 晚期融合, 计算机视觉, 自然语言处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/resnet-50clip
- Canonical: https://www.zingnex.cn/forum/thread/resnet-50clip
- Markdown 来源: floors_fallback

---

## [Introduction] Practical Exploration of Multimodal Emotion Recognition with ResNet-50 and CLIP Fusion

This article introduces a multimodal emotion recognition framework combining ResNet-50 visual features and CLIP text embeddings, using a late fusion strategy, which provides a practical reference for cross-modal learning. This project is a course project for HAICAI 2026, released by makisb on GitHub (link: https://github.com/makisb/multimodal-emotion-recognition). The core idea is to use a dual-branch model to process visual and text information separately, then perform weighted fusion to explore the path of multimodal emotion recognition.

## Project Background and Research Significance

Emotion recognition is an important direction in the AI field, but traditional single-modal methods (only images or text) have limitations—human emotional expression is inherently multimodal (e.g., a smile plus sarcastic text needs to be judged together). Multimodal learning can achieve more comprehensive emotional understanding by processing visual and text information simultaneously. This project, as a HAICAI 2026 course project, is a practical exploration carried out under this background.

## Technical Architecture: Dual-Branch Late Fusion Design

### Visual Branch: ResNet-50
ResNet-50 is selected as the visual feature extractor, inputting 224×224 images and outputting emotion classification logits. When used alone, the accuracy reaches 57.0% and the macro-average F1 is 57.4%, providing a solid foundation for fusion.

### Text Branch: CLIP ViT-B/32
The OpenAI CLIP model (ViT-B/32 version) is used to extract text embeddings, but its performance alone is poor: accuracy 23.8%, macro-average F1 only 17.3%, indicating that pure text has limited effect in fine-grained emotion classification.

### Late Fusion Strategy
Late fusion (each modality predicts independently first then combines) is adopted, with weights set to 0.6 for visual and 0.4 for text. After fusion, the accuracy is the same as the pure visual model (57.0%), and the macro-average F1 remains 57.4%.

## Implementation Details: Code and Experimental Process

### Project Structure
Core files include:
- `HAICAI_2026.ipynb`: Main notebook with complete workflow
- `README.md`: Project documentation
- `requirements.txt`: Python dependencies

### Dependency Management
Depends on mainstream deep learning libraries (PyTorch, Torchvision, Transformers) and OpenAI CLIP (needs to be installed from source code).

### Experimental Process
Follows the standard ML workflow: data loading and preprocessing → feature extraction → cross-modal pairing → model training and evaluation → performance metric calculation, ensuring reproducibility.

## Analysis of Experimental Results

Experiments compared the performance of three configurations:
| Model | Accuracy | Macro-average F1 |
|------|--------|----------|
| Pure Visual (ResNet-50) |57.0% |57.4% |
| Pure Text (CLIP) |23.8% |17.3% |
| Multimodal (Late Fusion) |57.0% |57.4% |

**Analysis**:
1. Visual modality performs far better than text, which is consistent with the characteristics of emotion recognition (facial expressions reflect emotions more directly);
2. Fusion did not significantly improve accuracy, which may require optimizing the dataset or fusion strategy;
3. The macro-average F1 is consistent, indicating that the model performs evenly across all emotion categories.

## Limitations and Future Optimization Directions

The current framework has the following optimization directions:
1. **Early Fusion Architecture**: Achieve deeper modal interaction at the feature level;
2. **Attention Fusion**: Dynamically adjust modal contribution weights instead of fixed weights;
3. **Hyperparameter Optimization**: Systematically search for better fusion weights;
4. **Larger Dataset**: Solve the data volume bottleneck;
5. **Vision Transformer**: Replace ResNet-50 to explore a better visual encoder.

The project's technical selection reflects the priority of stability and interpretability in academic scenarios.

## Practical Application Value and Conclusion

### Application Value
- For multimodal learning beginners: Provides a complete workflow (environment configuration → model evaluation) with clear and reproducible code;
- For HAICAI 2026 students: An excellent case for practicing multimodal technology;
- Practical scenarios: Social media analysis (image-text emotion judgment), customer service (voice + expression evaluation), mental health monitoring (multimodal emotion tracking).

### Conclusion
Although this project is not large-scale, it clearly demonstrates the basic paradigm of multimodal learning: select single-modal encoders → design fusion strategies → experimental evaluation. Understanding basic concepts is more important than chasing the latest models. The open-source code of the project provides an extensible benchmark for the community, and we look forward to more improvements and innovations.