Zing Forum

Reading

Real-Time American Sign Language Recognition System Based on CNN and MediaPipe

A real-time American Sign Language (ASL) gesture recognition system built using TensorFlow/Keras, OpenCV, and MediaPipe, which enables real-time sign language detection via a camera using convolutional neural networks.

手语识别ASL卷积神经网络MediaPipeOpenCVTensorFlow计算机视觉深度学习无障碍技术实时识别
Published 2026-05-22 19:44Recent activity 2026-05-22 19:50Estimated read 6 min
Real-Time American Sign Language Recognition System Based on CNN and MediaPipe
1

Section 01

[Main Floor/Introduction] Real-Time American Sign Language Recognition System Based on CNN and MediaPipe

This article introduces an open-source real-time American Sign Language (ASL) recognition system built using TensorFlow/Keras, OpenCV, and MediaPipe, which enables real-time gesture recognition via a regular camera. The project aims to lower the barrier to sign language communication, promote integration between the hearing-impaired community and society, and can run without specialized hardware.

2

Section 02

Project Background and Core Objectives

Sign language is an important communication method for the hearing-impaired, but most people are not familiar with this "language". The goal of this project is to build an end-to-end real-time ASL alphabet recognition system to connect different groups. Unlike solutions that rely on specialized hardware, it only requires a regular computer camera to run, significantly reducing deployment costs and usage barriers.

3

Section 03

Technology Stack and Architecture Design

  • Deep learning framework: Uses TensorFlow as the underlying framework and Keras as the high-level API; the core model is a Convolutional Neural Network (CNN), which is suitable for image tasks;
  • Computer vision tools: OpenCV handles video stream capture and preprocessing; MediaPipe's Hands module tracks 21 hand key points in real time, helping to locate and crop the hand region to improve accuracy;
  • Dataset: Uses the Sign MNIST dataset (annotated images of 26 ASL letters) as the training foundation.
4

Section 04

Detailed System Workflow

  1. Data preprocessing: Raw images are normalized and converted to grayscale via OpenCV; MediaPipe extracts the hand ROI (Region of Interest) and crops/scales it to a uniform size;
  2. Model training: Uses a lightweight CNN architecture (LeNet-style), trained on the Sign MNIST dataset, combined with data augmentation (rotation, scaling, brightness adjustment) to improve generalization ability;
  3. Real-time inference: Camera captures frames → MediaPipe detects key points → CNN classifies and predicts → outputs results; real-time performance is achievable on a regular CPU.
5

Section 05

Technical Highlights and Innovations

  1. Lightweight model design: Balances accuracy and inference speed to ensure smooth operation on resource-constrained devices;
  2. Multimodal input fusion: Can flexibly combine image and hand key point features to improve robustness in complex scenarios;
  3. End-to-end open-source implementation: Provides complete code (preprocessing, training, inference) to lower the threshold for learning and secondary development.
6

Section 06

Application Scenarios and Social Value

  • Educational assistance: Self-test feedback for sign language learners, and teachers can evaluate students' gesture accuracy;
  • Accessible communication: Serves as a temporary translation tool in scenarios like public service windows and medical institutions;
  • Human-computer interaction innovation: Extends to smart home control, virtual reality interaction, and other fields, providing a natural interaction method.
7

Section 07

Limitations and Improvement Directions

The current version only recognizes static ASL letters and has limited ability to recognize continuous sign language sentences (dynamic trajectories and grammar). Improvement directions:

  • Introduce temporal models (LSTM/Transformer) to handle dynamic gestures;
  • Expand vocabulary to support more phrases;
  • Optimize mobile performance and develop mobile applications;
  • Combine NLP to achieve complete translation from sign language to natural language.
8

Section 08

Conclusion: Promoting Inclusive Technology Development

This project demonstrates the application potential of deep learning in the field of accessible technology, building a practical solution using mature tools and lightweight models. We look forward to more open-source projects emerging to jointly promote the development of inclusive technology, so that technology can truly serve everyone.