Audio Feature Extraction
The audio branch usually extracts the following types of features:
Acoustic Features: Including fundamental frequency (F0), formants, Mel-frequency cepstral coefficients (MFCC), etc., which reflect the physical characteristics of pronunciation.
Prosodic Features: Speech rate, pause duration and frequency, pitch variation range, etc., which are related to language fluency and cognitive load.
Speech Quality Features: Jitter, shimmer, harmonic-to-noise ratio (HNR), etc., which may reflect changes in neuromuscular control.
Text Feature Extraction
The text branch focuses on multiple dimensions of language use:
Lexical Features: Word frequency distribution, vocabulary diversity, word length distribution, semantic density, etc.
Syntactic Features: Sentence length, syntactic complexity, clause usage frequency, grammatical error rate, etc.
Semantic Features: Contextual semantic representations extracted using pre-trained language models (e.g., BERT, RoBERTa).
Pragmatic Features: Discourse coherence, topic maintenance ability, information content density, etc.
Multimodal Fusion Strategies
The project explores various fusion strategies:
Early Fusion: Concatenate audio and text features at the feature level and input them into a unified classifier.
Mid Fusion: Learn audio and text representations separately and perform interactive fusion at the middle layer.
Late Fusion: The two modalities predict independently, and the results are integrated through voting or weighted averaging.
Attention Mechanism: Use cross-modal attention mechanisms to allow the model to learn the associations between audio and text features.