Zing Forum

Reading

CNN-Based American Sign Language Recognition System: A Complete Implementation from Baseline Model to Mobile Optimization

A fully reproducible deep learning project using PyTorch to implement convolutional neural networks for recognizing 24 static American Sign Language (ASL) gestures, comparing three approaches: baseline CNN, regularized custom CNN, and MobileNetV2 transfer learning.

手语识别卷积神经网络深度学习PyTorch迁移学习MobileNetV2美国手语可解释AIGrad-CAM计算机视觉
Published 2026-06-07 03:14Recent activity 2026-06-07 03:20Estimated read 6 min
CNN-Based American Sign Language Recognition System: A Complete Implementation from Baseline Model to Mobile Optimization
1

Section 01

[Introduction] Core Overview of the CNN-Based American Sign Language Recognition System

Project Core Overview

This project was released by Tao-feek001 on GitHub on June 6, 2026 (repository name: Hand-Sign-Recognition-Using-CNN). It aims to recognize 24 static gestures of American Sign Language (ASL) (A-Y excluding J/Z) using deep learning. The project compares three CNN architectures: baseline CNN, regularized custom CNN, and MobileNetV2 transfer learning model adapted for grayscale input. It covers the complete workflow including dataset preprocessing, experimental design, interpretability analysis, and reproducibility guarantees. The custom CNN was finally selected as the optimal solution, balancing accuracy and computational efficiency.

2

Section 02

Research Background and Dataset Preprocessing

Research Background and Dataset

Background: Sign language is the primary communication method for the hearing-impaired community, but its low adoption rate creates communication barriers. Automatic recognition technology can help break these barriers. Dataset: A total of 34,027 28×28 grayscale images, with 26,755 in the training set and 7,272 in the test set, organized by category. Preprocessing:

  • Statistical normalization (based on dataset mean/std);
  • Stratified sampling to split training/validation sets to ensure class balance;
  • Domain-aware augmentation: exclude horizontal flipping (to avoid gesture confusion).
3

Section 03

Comparison of Three Model Architectures

Comparison of Three Model Architectures

  1. Baseline CNN: Minimalist design (2 convolutional layers) as a performance benchmark to verify the improvement value of complex models.
  2. Custom CNN: 4 convolutional blocks (including convolutional layer + batch normalization + Dropout) to balance model capacity and regularization, preventing overfitting.
  3. MobileNetV2 Transfer Learning: Modify original RGB input to single-channel grayscale, replace the classification head with 24-class output, explore the potential of pre-trained models.
4

Section 04

Experimental Design and Reproducibility Guarantees

Experimental Design and Reproducibility

Experimental Optimization:

  • Optimizer comparison (SGD, Adam, RMSprop);
  • Learning rate grid search;
  • Augmentation ablation experiments;
  • Multi-seed evaluation (3 random seeds, report mean ± standard deviation). Reproducibility Guarantees:
  • Fixed random seeds;
  • CUDA deterministic configuration;
  • Fixed dependency versions (requirements.txt);
  • Save model weights, visualization charts, and other intermediate products.
5

Section 05

Experimental Results and Model Analysis

Result Analysis and Model Selection

Optimal Model: Custom CNN, reasons:

  • Satisfactory test set accuracy;
  • Small number of parameters and fast inference speed (CPU/GPU latency test);
  • Stable training (effective regularization);
  • Strong interpretability (Grad-CAM visualization focuses on key gesture areas). Analysis:
  • Error cases and confusion matrix identify easily confused gesture pairs;
  • Inference performance tests (CPU/GPU latency, throughput) provide references for deployment.
6

Section 06

Application Value and Future Expansion Directions

Application Value and Future Directions

Applications:

  • Assistive communication tools (between hearing-impaired and hearing communities);
  • Sign language learning education aid;
  • Foundation for complex sign language recognition research. Future Expansion:
  • Extend to complete ASL vocabulary (including dynamic gestures);
  • Integrate real-time recognition on mobile devices;
  • Combine pose estimation to handle complex scenarios;
  • Multimodal fusion (facial expressions + lip reading).