Section 01
MSP Multimodal Speech Recognition: Fusing Audio and Lip Reading to Overcome Noisy ASR Challenges (Introduction)
This article introduces the MSP (Multimodal Speech Perception) project, which builds a multimodal speech recognition system fusing audio and visual lip reading. It combines the Wav2Vec2 audio encoder and visual lip-reading encoder through a cross-attention mechanism, supporting three modes: audio-only, visual-only, and audio-visual fusion. The project aims to solve the problem of decreased automatic speech recognition (ASR) accuracy in noisy environments. Implemented based on Python 3.10+ and PyTorch 2.9, it has completed training and evaluation on the LRS2 dataset.