# CNSL-bench: The First National Common Sign Language Benchmark in China, Revealing Systematic Gaps in Sign Language Understanding by Multimodal Large Models

> The research team launched CNSL-bench, the first authoritative benchmark based on the National Common Sign Language Dictionary, and evaluated 21 mainstream multimodal large models (MLLMs). It was found that current MLLMs still fall far below human level in sign language understanding tasks, with systematic differences across modalities and expression forms.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-24T08:59:33.000Z
- 最近活动: 2026-04-27T01:55:55.356Z
- 热度: 86.1
- 关键词: 手语理解, 多模态大模型, CNSL-bench, 国家通用手语, 听障人士, AI包容性, 视频理解, 跨模态对齐
- 页面链接: https://www.zingnex.cn/en/forum/thread/cnsl-bench
- Canonical: https://www.zingnex.cn/forum/thread/cnsl-bench
- Markdown 来源: floors_fallback

---

## CNSL-bench: The First National Common Sign Language Benchmark in China, Revealing Systematic Gaps in Sign Language Understanding by Multimodal Large Models

The research team launched CNSL-bench, the first authoritative benchmark based on the National Common Sign Language Dictionary, and evaluated 21 mainstream multimodal large models (MLLMs). It was found that current MLLMs still fall far below human level in sign language understanding tasks, with systematic differences across modalities and expression forms. This post will analyze from dimensions such as background, benchmark construction, evaluation results, conclusions and recommendations.

## AI Challenges in Sign Language Understanding: The Neglected Multimodal Frontier

With the rapid development of large language models (LLMs) and multimodal large language models (MLLMs), AI has made progress in tasks such as visual understanding and speech recognition, but sign language understanding is still at the cutting edge. Sign language is a complete visual-spatial language involving multi-dimensional complex information such as hand movements, facial expressions, and body postures. Understanding sign language requires mastering a complete language system. The key question now: **How strong is the sign language understanding ability of current MLLMs?**

## CNSL-bench Construction: Authoritative, Multimodal, and Diversified Benchmark Design

### Three Core Features
1. **Authoritative Foundation**: Anchored to the National Common Sign Language Dictionary, eliminating ambiguity, ensuring consistency, and being close to the actual use by the hearing-impaired;
2. **Multimodal Coverage**: Each vocabulary entry includes text description, illustrative images, and sign language videos, supporting cross-modal ability evaluation;
3. **Diversity of Expression Forms**: Covers three key articulatory types: air writing, finger spelling, and Chinese finger alphabet.

### Construction Methods
- **Data Processing**: Select representative vocabulary, multimodal alignment, expert quality review, add fine-grained annotations;
- **Evaluation Tasks**: Recognition tasks (video/image → vocabulary), description tasks (sign language → text), alignment tasks (visual-text matching), reasoning tasks (reasoning based on sign language).

## Evaluation Results of 21 MLLMs: Significant Systematic Gaps

### Key Findings
1. **Huge Gap from Human Level**: Even the most advanced models still have accuracy far below human level; sign language understanding is an open problem;
2. **Cross-modal Differences**: Video understanding is the weakest (hard to capture temporal dynamics), image understanding is better, and text-visual alignment is difficult;
3. **Differences in Expression Forms**: Finger spelling is relatively easy to recognize, air writing is the most challenging (3D trajectory has no clear boundaries), and natural gestures are in the middle;
4. **Fundamental Limitations**: It's not just insufficient reasoning; basic architecture and training data have a greater impact, and there are common error patterns.

## Conclusion: Sign Language Understanding Requires Fundamental Improvement of MLLM Architecture

The current gap of MLLMs in sign language understanding is substantial, indicating that sign language understanding is still a highly challenging open problem in the AI field. The systematic differences in model performance suggest that we need to fundamentally rethink visual encoders and multimodal alignment strategies, rather than relying only on fine-tuning to improve reasoning ability.

## Recommendations for MLLM Development: Focus on Video and Multimodal Alignment

1. **Improve Video Understanding**: Strengthen video encoders to capture fine-grained movements and optimize temporal modeling capabilities;
2. **Optimize Multimodal Alignment**: Pre-train/fine-tune on sign language data and design targeted alignment objective functions;
3. **Increase Inclusive Data**: Expand the scale, diversity, and high-quality aligned data of sign language datasets;
4. **Diversify Evaluation Benchmarks**: Cover more sign languages, continuous sentence understanding, and generation ability evaluation.

## Social Significance and Future Directions: Towards Inclusive AI

### Social Significance
- Benefit tens of millions of hearing-impaired people and bridge the communication gap;
- Promote the inclusiveness of AI technology and avoid neglecting marginalized groups;
- Provide measurement standards for sign language AI technology and promote the implementation of innovation.

### Limitations and Future
- **Limitations**: Covers isolated vocabulary, only Chinese common sign language, does not evaluate generation ability, and model scale can be expanded;
- **Future**: Develop continuous sign language understanding benchmarks, build multilingual frameworks, explore sign language generation technology, and study multimodal joint understanding.

CNSL-bench is an important step towards inclusive AI. We look forward to more researchers paying attention to the field of sign language understanding, so that AI can benefit users of all forms of communication.