Pose Detection Layer: Accurate Capture with MediaPipe
The Google MediaPipe Hands module is selected to detect the coordinates of 21 key points of the hand in real time. It only requires an ordinary RGB camera, with a single-frame processing delay of less than 10 milliseconds, lowering the hardware threshold.
Gesture Recognition Layer: MLP Neural Network Design
The input layer receives 42-dimensional hand key point coordinates (21 points x, y coordinates per hand). Features are extracted through two hidden layers (128 and 64 neurons with ReLU activation), and the output is the probability distribution corresponding to the Arabic Sign Language alphabet. MLP is chosen because of its simple structure, fast training and inference, and small model size, making it suitable for real-time applications.
Application Interaction Layer: FastAPI and React Combination
The backend uses FastAPI to handle high-concurrency video stream requests and automatically generate API documentation. The frontend uses React, which connects to the backend in real time via WebSocket to transmit video frames and display recognition results.