The processing pipeline of audx-realtime is very clear, breaking down audio processing into several stages:
First, the input audio data is automatically resampled to 48kHz (if the original sampling rate is different). The SpeexDSP library is used here for high-quality sampling rate conversion. Then, the audio is split into frames of 480 samples (corresponding to 10 milliseconds at 48kHz), which is the basic unit of processing.
Next, the algorithm extracts 42 acoustic features from each frame. These features include spectral envelope, fundamental frequency estimation, spectral flatness, etc., which provide rich input information for the neural network. Based on these features, the neural network calculates gain values for 22 frequency bands and outputs the probability of voice activity detection.
Finally, these gains are applied to the original spectrum to achieve denoising. If needed, the processed audio is resampled back to the original sampling rate for output. The total delay of the entire processing pipeline is approximately 10 to 13 milliseconds, meeting the strict requirements of real-time communication.