Zing Forum

Reading

S-Bench: A Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models

The first comprehensive benchmark suite dedicated to evaluating the social intelligence capabilities of multimodal large language models

benchmarksocial intelligencemultimodalevaluationtheory of mindemotion recognition
Published 2026-03-29 18:01Recent activity 2026-03-29 18:20Estimated read 6 min
S-Bench: A Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models
1

Section 01

[Introduction] S-Bench: A Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models

S-Bench is the first comprehensive benchmark suite dedicated to evaluating the social intelligence capabilities of multimodal large language models. Addressing the limitations of existing evaluations, it covers dimensions such as theory of mind, emotion recognition, and social norms, using multimodal inputs and multi-dimensional evaluation metrics. It provides a standardized tool for model development, product selection, and academic research, while promoting future directions like cross-cultural expansion and dynamic interactive evaluation through the open-source community.

2

Section 02

Background: Limitations of Existing Evaluations and the Necessity of Multimodal Social Intelligence

Limitations of Existing Evaluations

Traditional LLM evaluations focus on knowledge reserve (e.g., MMLU), reasoning ability (e.g., GSM8K), and language skills, but fail to assess performance in real social scenarios (such as understanding sarcasm or microexpressions).

Necessity of Multimodality

Social interaction is multimodal (language + facial expressions + body language + tone), so evaluating social intelligence requires simultaneous processing of text, images, videos, and other information.

3

Section 03

Methodology: Core Design and Technical Implementation of S-Bench

Evaluation Dimensions

  1. Theory of Mind: Inferring intentions, false beliefs, decision differences
  2. Emotion Recognition: Facial/voice/text emotions, complex emotional states
  3. Social Norms: Appropriate behavior, cultural etiquette, consequences of violations
  4. Interpersonal Reasoning: Relationship types, power structures, social strategies
  5. Moral Judgment: Dilemma analysis, fairness, cross-cultural differences

Dataset Construction

Diversity (age/gender/culture), authenticity, difficulty gradient, anti-contamination

Technical Implementation

  • Multimodal inputs: Image-text, video, plain text, audio
  • Evaluation metrics: Accuracy, consistency, interpretability, human alignment
4

Section 04

Experimental Findings: Current State of Multimodal Fusion and Social Intelligence

  1. Multimodal Fusion Challenge: Single-modal performance is good, but performance drops significantly in multimodal integrated tasks
  2. Cultural Bias Exposure: More familiar with Western cultural norms, insufficient understanding of other cultures
  3. Superficial Emotion Understanding: Can recognize obvious emotions, but has limited understanding of subtle/contradictory/repressed emotions
5

Section 05

Application Scenarios: Practical Value of S-Bench

  1. Model Development Guidance: Optimize weak links through fine-grained results
  2. Product Selection Reference: Provide a basis for model comparison for applications like virtual assistants and social robots
  3. Academic Research Platform: Standardized evaluation tools promote progress in the field of social intelligence
6

Section 06

Future Directions: Expansion Plans for S-Bench

  1. Dynamic interactive evaluation: Simulate real-time social dialogue
  2. Embodied intelligence expansion: Evaluate capabilities in physical social scenarios
  3. Cross-cultural deepening: Strengthen evaluation of non-Western cultural norms
  4. Long-term social memory: Evaluate the model's ability to maintain long-term social memory
7

Section 07

Community and Open Source: Open Collaboration Model

S-Bench adopts an open-source model and encourages community participation:

  • Dataset expansion: Accept new test scenarios
  • Evaluation method improvement: Optimize metrics and processes
  • Cross-cultural contributions: Solicit test cases from different cultures
  • Model submission: Support applications for evaluating new models