Zing Forum

Reading

MMSU: A New Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models

MMSU is an evaluation benchmark specifically designed for the social intelligence capabilities of multimodal large language models, filling the gap in the current AI evaluation system for measuring social cognitive abilities.

多模态模型社交智能基准测试情绪识别人机交互
Published 2026-05-05 19:55Recent activity 2026-05-05 20:22Estimated read 7 min
MMSU: A New Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models
1

Section 01

MMSU: Introduction to the New Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models

MMSU (Multimodal Social Understanding) is an evaluation benchmark for the social intelligence capabilities of multimodal large language models, filling the gap in the current AI evaluation system for measuring social cognitive abilities. It provides a systematic framework to assess models' understanding and reasoning abilities in complex social scenarios, covering multiple dimensions such as emotion recognition and social context reasoning. Preliminary evaluations reveal that mainstream models have significant shortcomings in social intelligence, which is of great value for AI research, development, and industry applications.

2

Section 02

Background and Motivation: Limitations of Existing MLLM Evaluations

Current evaluations of multimodal large language models (MLLMs) mainly focus on traditional tasks such as visual question answering and image caption generation, but social intelligence (e.g., understanding sarcasm, recognizing emotions, inferring intentions) that humans rely on in daily communication is rarely covered in existing systems. These abilities are crucial for building natural human-computer interaction systems, so the MMSU project was born to fill this gap.

3

Section 03

Core Social Intelligence Evaluation Dimensions of MMSU

The MMSU dataset covers multiple social intelligence dimensions:

  • Emotion Recognition and Understanding: Recognize emotions from facial expressions, body language, and speech intonation
  • Social Context Reasoning: Understand behavioral norms, role relationships, and interaction patterns in social situations
  • Sarcasm and Humor Detection: Identify irony, puns, and humorous elements
  • Intention Inference: Infer real intentions and potential motivations from limited information
  • Cultural and Social Norms: Understand social etiquette and norms across different cultural backgrounds
4

Section 04

Technical Architecture and Design Principles of MMSU

MMSU adopts strict evaluation design principles:

  1. Multimodal Fusion: Questions require simultaneous processing of visual and textual information
  2. Distractor Design: Incorrect options are highly misleading and require true social understanding to distinguish
  3. Cross-Cultural Coverage: Includes scenarios from different cultural backgrounds to avoid Western-centric bias
  4. Difficulty Stratification: Forms a progressive difficulty curve from basic emotion recognition to complex social reasoning
5

Section 05

Preliminary Evaluation Results of MMSU: Social Intelligence Shortcomings of Mainstream Models

Preliminary evaluations based on MMSU found:

  • The accuracy of the best models in social intelligence tasks is far lower than in traditional visual tasks
  • Models have systematic defects in understanding subtle emotions and non-literal language
  • Generalization ability in cross-cultural social scenarios is generally weak
  • The growth of model scale does not automatically lead to synchronous improvement in social intelligence These indicate that social intelligence requires specialized design and training strategies.
6

Section 06

Practical Significance and Application Prospects of MMSU

The value of MMSU for the AI field: Researchers: A standardized evaluation tool to identify models' social cognitive defects and guide improvement directions Developers: Refer to scores to determine whether models are suitable for scenarios requiring in-depth social understanding (e.g., virtual assistants, educational robots) Industry: Promote AI evolution from "able to converse" to "understand conversations", enhancing user experience and trust

7

Section 07

Usage and Community Contribution of MMSU

The MMSU project is fully open-source; researchers and developers can obtain the dataset, evaluation code, and benchmark results via GitHub. The project encourages the community to contribute diverse social scenario samples, especially cases from non-Western cultural backgrounds, to improve the comprehensiveness and fairness of the evaluation.

8

Section 08

Conclusion: Social Intelligence is a Key Component of General AI

Social intelligence is key to artificial intelligence moving toward general intelligence. MMSU provides a "health check report" for current multimodal models and points the way for the design of next-generation models. We look forward to the emergence of more empathetic AI systems that can handle complex social environments in the future.