Zing Forum

Reading

CNSL-bench: The First National Common Sign Language Benchmark in China, Revealing Systematic Gaps in Sign Language Understanding by Multimodal Large Models

The research team launched CNSL-bench, the first authoritative benchmark based on the National Common Sign Language Dictionary, and evaluated 21 mainstream multimodal large models (MLLMs). It was found that current MLLMs still fall far below human level in sign language understanding tasks, with systematic differences across modalities and expression forms.

手语理解多模态大模型CNSL-bench国家通用手语听障人士AI包容性视频理解跨模态对齐
Published 2026-04-24 16:59Recent activity 2026-04-27 09:55Estimated read 8 min
CNSL-bench: The First National Common Sign Language Benchmark in China, Revealing Systematic Gaps in Sign Language Understanding by Multimodal Large Models
1

Section 01

CNSL-bench: The First National Common Sign Language Benchmark in China, Revealing Systematic Gaps in Sign Language Understanding by Multimodal Large Models

The research team launched CNSL-bench, the first authoritative benchmark based on the National Common Sign Language Dictionary, and evaluated 21 mainstream multimodal large models (MLLMs). It was found that current MLLMs still fall far below human level in sign language understanding tasks, with systematic differences across modalities and expression forms. This post will analyze from dimensions such as background, benchmark construction, evaluation results, conclusions and recommendations.

2

Section 02

AI Challenges in Sign Language Understanding: The Neglected Multimodal Frontier

With the rapid development of large language models (LLMs) and multimodal large language models (MLLMs), AI has made progress in tasks such as visual understanding and speech recognition, but sign language understanding is still at the cutting edge. Sign language is a complete visual-spatial language involving multi-dimensional complex information such as hand movements, facial expressions, and body postures. Understanding sign language requires mastering a complete language system. The key question now: How strong is the sign language understanding ability of current MLLMs?

3

Section 03

CNSL-bench Construction: Authoritative, Multimodal, and Diversified Benchmark Design

Three Core Features

  1. Authoritative Foundation: Anchored to the National Common Sign Language Dictionary, eliminating ambiguity, ensuring consistency, and being close to the actual use by the hearing-impaired;
  2. Multimodal Coverage: Each vocabulary entry includes text description, illustrative images, and sign language videos, supporting cross-modal ability evaluation;
  3. Diversity of Expression Forms: Covers three key articulatory types: air writing, finger spelling, and Chinese finger alphabet.

Construction Methods

  • Data Processing: Select representative vocabulary, multimodal alignment, expert quality review, add fine-grained annotations;
  • Evaluation Tasks: Recognition tasks (video/image → vocabulary), description tasks (sign language → text), alignment tasks (visual-text matching), reasoning tasks (reasoning based on sign language).
4

Section 04

Evaluation Results of 21 MLLMs: Significant Systematic Gaps

Key Findings

  1. Huge Gap from Human Level: Even the most advanced models still have accuracy far below human level; sign language understanding is an open problem;
  2. Cross-modal Differences: Video understanding is the weakest (hard to capture temporal dynamics), image understanding is better, and text-visual alignment is difficult;
  3. Differences in Expression Forms: Finger spelling is relatively easy to recognize, air writing is the most challenging (3D trajectory has no clear boundaries), and natural gestures are in the middle;
  4. Fundamental Limitations: It's not just insufficient reasoning; basic architecture and training data have a greater impact, and there are common error patterns.
5

Section 05

Conclusion: Sign Language Understanding Requires Fundamental Improvement of MLLM Architecture

The current gap of MLLMs in sign language understanding is substantial, indicating that sign language understanding is still a highly challenging open problem in the AI field. The systematic differences in model performance suggest that we need to fundamentally rethink visual encoders and multimodal alignment strategies, rather than relying only on fine-tuning to improve reasoning ability.

6

Section 06

Recommendations for MLLM Development: Focus on Video and Multimodal Alignment

  1. Improve Video Understanding: Strengthen video encoders to capture fine-grained movements and optimize temporal modeling capabilities;
  2. Optimize Multimodal Alignment: Pre-train/fine-tune on sign language data and design targeted alignment objective functions;
  3. Increase Inclusive Data: Expand the scale, diversity, and high-quality aligned data of sign language datasets;
  4. Diversify Evaluation Benchmarks: Cover more sign languages, continuous sentence understanding, and generation ability evaluation.
7

Section 07

Social Significance and Future Directions: Towards Inclusive AI

Social Significance

  • Benefit tens of millions of hearing-impaired people and bridge the communication gap;
  • Promote the inclusiveness of AI technology and avoid neglecting marginalized groups;
  • Provide measurement standards for sign language AI technology and promote the implementation of innovation.

Limitations and Future

  • Limitations: Covers isolated vocabulary, only Chinese common sign language, does not evaluate generation ability, and model scale can be expanded;
  • Future: Develop continuous sign language understanding benchmarks, build multilingual frameworks, explore sign language generation technology, and study multimodal joint understanding.

CNSL-bench is an important step towards inclusive AI. We look forward to more researchers paying attention to the field of sign language understanding, so that AI can benefit users of all forms of communication.