# S-Bench: A Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models

> The first comprehensive benchmark suite dedicated to evaluating the social intelligence capabilities of multimodal large language models

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-29T10:01:24.000Z
- 最近活动: 2026-03-29T10:20:12.956Z
- 热度: 146.7
- 关键词: benchmark, social intelligence, multimodal, evaluation, theory of mind, emotion recognition
- 页面链接: https://www.zingnex.cn/en/forum/thread/s-bench
- Canonical: https://www.zingnex.cn/forum/thread/s-bench
- Markdown 来源: floors_fallback

---

## [Introduction] S-Bench: A Benchmark for Evaluating Social Intelligence of Multimodal Large Language Models

S-Bench is the first comprehensive benchmark suite dedicated to evaluating the social intelligence capabilities of multimodal large language models. Addressing the limitations of existing evaluations, it covers dimensions such as theory of mind, emotion recognition, and social norms, using multimodal inputs and multi-dimensional evaluation metrics. It provides a standardized tool for model development, product selection, and academic research, while promoting future directions like cross-cultural expansion and dynamic interactive evaluation through the open-source community.

## Background: Limitations of Existing Evaluations and the Necessity of Multimodal Social Intelligence

### Limitations of Existing Evaluations
Traditional LLM evaluations focus on knowledge reserve (e.g., MMLU), reasoning ability (e.g., GSM8K), and language skills, but fail to assess performance in real social scenarios (such as understanding sarcasm or microexpressions).
### Necessity of Multimodality
Social interaction is multimodal (language + facial expressions + body language + tone), so evaluating social intelligence requires simultaneous processing of text, images, videos, and other information.

## Methodology: Core Design and Technical Implementation of S-Bench

### Evaluation Dimensions
1. Theory of Mind: Inferring intentions, false beliefs, decision differences
2. Emotion Recognition: Facial/voice/text emotions, complex emotional states
3. Social Norms: Appropriate behavior, cultural etiquette, consequences of violations
4. Interpersonal Reasoning: Relationship types, power structures, social strategies
5. Moral Judgment: Dilemma analysis, fairness, cross-cultural differences
### Dataset Construction
Diversity (age/gender/culture), authenticity, difficulty gradient, anti-contamination
### Technical Implementation
- Multimodal inputs: Image-text, video, plain text, audio
- Evaluation metrics: Accuracy, consistency, interpretability, human alignment

## Experimental Findings: Current State of Multimodal Fusion and Social Intelligence

1. **Multimodal Fusion Challenge**: Single-modal performance is good, but performance drops significantly in multimodal integrated tasks
2. **Cultural Bias Exposure**: More familiar with Western cultural norms, insufficient understanding of other cultures
3. **Superficial Emotion Understanding**: Can recognize obvious emotions, but has limited understanding of subtle/contradictory/repressed emotions

## Application Scenarios: Practical Value of S-Bench

1. **Model Development Guidance**: Optimize weak links through fine-grained results
2. **Product Selection Reference**: Provide a basis for model comparison for applications like virtual assistants and social robots
3. **Academic Research Platform**: Standardized evaluation tools promote progress in the field of social intelligence

## Future Directions: Expansion Plans for S-Bench

1. Dynamic interactive evaluation: Simulate real-time social dialogue
2. Embodied intelligence expansion: Evaluate capabilities in physical social scenarios
3. Cross-cultural deepening: Strengthen evaluation of non-Western cultural norms
4. Long-term social memory: Evaluate the model's ability to maintain long-term social memory

## Community and Open Source: Open Collaboration Model

S-Bench adopts an open-source model and encourages community participation:
- Dataset expansion: Accept new test scenarios
- Evaluation method improvement: Optimize metrics and processes
- Cross-cultural contributions: Solicit test cases from different cultures
- Model submission: Support applications for evaluating new models
