Section 01
RLAIF-SPA: Introduction to the Breakthrough in AI Feedback Reinforcement Learning-Driven Emotional Speech Synthesis
RLAIF-SPA is a novel framework integrating automatic speech recognition (Whisper) and large language models (Qwen2-Audio, GPT-4o). It addresses the trade-off between emotional expression and intelligibility in emotional speech synthesis via Reinforcement Learning from AI Feedback (RLAIF), without requiring expensive manual annotations. Key innovations include a four-dimensional fine-grained prosody label system and the GRPO optimization algorithm. Experiments show significant improvements in intelligibility (WER reduced by 26.1%) and speaker similarity (SIM-O increased by 9.1%), providing a successful example of RLAIF application in specific domains.