Zing Forum

Reading

Real-Time Arabic Sign Language Translation System: An AI-Assisted Communication Tool Based on MediaPipe and Neural Networks

This article introduces an open-source project that combines MediaPipe pose recognition with a multi-layer perceptron (MLP) neural network to achieve real-time translation from Arabic Sign Language (ArSL) to text, using a FastAPI backend and React frontend architecture.

Arabic Sign LanguageMediaPipeMLPNeural NetworkFastAPIReactComputer VisionAccessibility
Published 2026-05-21 07:11Recent activity 2026-05-21 07:22Estimated read 7 min
Real-Time Arabic Sign Language Translation System: An AI-Assisted Communication Tool Based on MediaPipe and Neural Networks
1

Section 01

Introduction to the Real-Time Arabic Sign Language Translation System

This article introduces an open-source real-time Arabic Sign Language translation system that combines MediaPipe pose recognition with a multi-layer perceptron (MLP) neural network. It uses a FastAPI backend and React frontend architecture, aiming to break down communication barriers between the deaf community and hearing people, lower hardware thresholds, and fill the gap in the technical field of Arabic Sign Language.

2

Section 02

Project Background and Significance

About 70 million deaf people worldwide use sign language to communicate, among which Arabic Sign Language (ArSL) has over 3 million users in the Middle East and North Africa. Communication barriers between sign language and spoken language lead to challenges for the deaf community in education, employment, and social interaction. Traditional human translation is costly and hard to cover daily scenarios, while computer vision and deep learning technologies provide new possibilities for real-time sign language recognition.

3

Section 03

Overview of Technical Architecture

Pose Detection Layer: Accurate Capture with MediaPipe

The Google MediaPipe Hands module is selected to detect the coordinates of 21 key points of the hand in real time. It only requires an ordinary RGB camera, with a single-frame processing delay of less than 10 milliseconds, lowering the hardware threshold.

Gesture Recognition Layer: MLP Neural Network Design

The input layer receives 42-dimensional hand key point coordinates (21 points x, y coordinates per hand). Features are extracted through two hidden layers (128 and 64 neurons with ReLU activation), and the output is the probability distribution corresponding to the Arabic Sign Language alphabet. MLP is chosen because of its simple structure, fast training and inference, and small model size, making it suitable for real-time applications.

Application Interaction Layer: FastAPI and React Combination

The backend uses FastAPI to handle high-concurrency video stream requests and automatically generate API documentation. The frontend uses React, which connects to the backend in real time via WebSocket to transmit video frames and display recognition results.

4

Section 04

Implementation Details and Key Technologies

Data Preprocessing

The original key point coordinates are normalized to the [-1,1] range with the wrist as the origin, eliminating the influence of camera resolution, distance, and angle to ensure model stability.

Model Training Strategy

The dataset contains samples of 28 letters of Arabic Sign Language. Samples from different Arab countries are collected to enhance generalization ability, and data augmentation techniques such as random rotation, scaling, and Gaussian noise are used.

Real-Time Inference Optimization

The frame sampling rate is controlled at 15-20fps to balance delay and load; a sliding window is used to smooth the results of consecutive frames to reduce misjudgments; a confidence threshold is set to output only high-confidence results.

5

Section 05

Application Scenarios and Social Value

Application scenarios include: assisting deaf students in practicing standard sign language in the education field; providing instant communication support in public service scenarios (hospitals, banks); helping hearing people learn basic sign language at home. The project is released as open source, lowering technical thresholds, encouraging global developers to participate, and filling the gap in Arabic Sign Language technology.

6

Section 06

Technical Insights and Future Outlook

Technical insights: Use mature pre-trained models (MediaPipe) to solve feature extraction, focus on upper-layer application logic, and shorten the development cycle. Future directions: Expand the vocabulary to cover complete sign language words; introduce temporal models (LSTM/Transformer) to recognize continuous gestures and grammar; explore edge computing deployment to enable offline operation on mobile devices, improving practicality and accessibility.