Zing Forum

Reading

F5-TTS-DPS: Achieving Undetectable High-Naturalness Speech Synthesis via EMA-Stabilized Training and Dual-Score Prompt Selection

This article introduces F5-TTS-DPS, the winning solution for the TTS track of the WildSpoof 2026 Challenge. Based on the F5-TTS architecture, this model incorporates Exponential Moving Average (EMA) and a dual-score prompt selection mechanism based on LLM/LALM. It achieved the best a-DCF scores on three advanced SASV detection systems, generating synthetic speech with extremely high naturalness that is difficult to detect and identify.

TTS语音合成反欺骗检测EMA提示选择WildSpoofF5-TTS深度伪造语音安全
Published 2026-05-23 01:18Recent activity 2026-05-25 14:18Estimated read 6 min
F5-TTS-DPS: Achieving Undetectable High-Naturalness Speech Synthesis via EMA-Stabilized Training and Dual-Score Prompt Selection
1

Section 01

F5-TTS-DPS: Guide to the Winning Solution for WildSpoof2026 TTS Track

This article introduces F5-TTS-DPS, the winning solution for the TTS track of the WildSpoof 2026 Challenge. Based on the F5-TTS architecture, this model incorporates Exponential Moving Average (EMA) and a dual-score prompt selection mechanism. It achieved the best a-DCF scores on three advanced SASV detection systems, generating speech with high naturalness that is difficult to detect.

Original author team: WildSpoof 2026 TTS track participating team Source platform: arXiv Release date: May 22, 2026 Original link: http://arxiv.org/abs/2605.23859v1

Keywords: TTS, speech synthesis, anti-spoofing detection, EMA, prompt selection, WildSpoof, F5-TTS, deepfake, speech security

2

Section 02

Background: The Arms Race Between Speech Synthesis and Anti-Spoofing Detection

In recent years, Text-to-Speech (TTS) technology has made breakthrough progress, with synthetic speech naturalness approaching human levels, but it also brings security challenges of rampant deepfake speech. Speech Anti-Spoofing (SASV) systems attempt to distinguish between real and synthetic speech, but the technical competition is far from over: as detection systems upgrade, more advanced TTS models are looking for breakthroughs. The WildSpoof Challenge requires training TTS models on real-scenario data to generate synthetic speech that is both natural and difficult to be identified by existing detection systems, and F5-TTS-DPS is the winning solution for the TTS track of this competition.

3

Section 03

Technical Solution: EMA-Stabilized Training and Dual-Score Prompt Selection

F5-TTS-DPS is based on the F/5-TTS architecture, with core innovations including:

  1. EMA-enhanced Supervised Fine-Tuning (SFT): Traditional SFT is prone to parameter oscillations. EMA maintains a smoothed parameter copy (θ_EMA(t) = α·θ_EMA(t-1)+(1-α)·θ(t)), suppressing noise disturbances and improving generalization ability.

  2. Dual-Score Prompt Selection: Using LLM to evaluate the grammar, semantics, and naturalness of text prompts, and LALM to assess the acoustic quality, clarity, and text alignment of reference audio. Dual filtering ensures high-quality training data.

4

Section 04

Experimental Results: Performance of High Naturalness and Undetectability

Performance of F5-TTS-DPS on WildSpoof2026 development set:

Metric Value Description
UTMOS 3.20 Speech naturalness score (higher means more natural)
Speaker Similarity 0.51 Similarity between synthetic speech and target speaker
WER Competitive level Word Error Rate, reflecting pronunciation accuracy

a-DCF scores on three advanced SASV detection systems (lower means harder to detect):

Detection System a-DCF Score Rank
System 1 0.1582 1st
System 2 0.5233 1st
System 3 0.2562 1st
5

Section 05

Technical Insights: Balance Between Naturalness and Deceptiveness and Application Significance

Research reveals that the boundary between naturalness and deceptiveness is blurred. The traditional view holds that high naturalness is easy to detect, but F5-TTS-DPS achieves a balance between the two through designed training strategies. Key technologies: EMA-stabilized training, dual-score data filtering. Application significance:

  1. Positive aspects: Provides new ideas for high-quality personalized speech synthesis (voice assistants, audiobooks, etc.);
  2. Security challenges: Existing detection systems need to accelerate upgrades to deal with new-generation TTS threats.
6

Section 06

Conclusion & Outlook: Technical Game Drives Domain Progress

The excellent performance of F5-TTS-DPS in WildSpoof2026 marks that TTS technology has entered a new stage. More technological innovations will emerge in the future, and the speech security field needs to continuously evolve to deal with the threat of realistic synthetic speech. The technical game drives the common progress of both sides.