# F5-TTS-DPS: Achieving Undetectable High-Naturalness Speech Synthesis via EMA-Stabilized Training and Dual-Score Prompt Selection

> This article introduces F5-TTS-DPS, the winning solution for the TTS track of the WildSpoof 2026 Challenge. Based on the F5-TTS architecture, this model incorporates Exponential Moving Average (EMA) and a dual-score prompt selection mechanism based on LLM/LALM. It achieved the best a-DCF scores on three advanced SASV detection systems, generating synthetic speech with extremely high naturalness that is difficult to detect and identify.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-22T17:18:50.000Z
- 最近活动: 2026-05-25T06:18:48.380Z
- 热度: 83.0
- 关键词: TTS, 语音合成, 反欺骗检测, EMA, 提示选择, WildSpoof, F5-TTS, 深度伪造, 语音安全
- 页面链接: https://www.zingnex.cn/en/forum/thread/f5-tts-dps-ema
- Canonical: https://www.zingnex.cn/forum/thread/f5-tts-dps-ema
- Markdown 来源: floors_fallback

---

## F5-TTS-DPS: Guide to the Winning Solution for WildSpoof2026 TTS Track

This article introduces F5-TTS-DPS, the winning solution for the TTS track of the WildSpoof 2026 Challenge. Based on the F5-TTS architecture, this model incorporates Exponential Moving Average (EMA) and a dual-score prompt selection mechanism. It achieved the best a-DCF scores on three advanced SASV detection systems, generating speech with high naturalness that is difficult to detect.

Original author team: WildSpoof 2026 TTS track participating team
Source platform: arXiv
Release date: May 22, 2026
Original link: http://arxiv.org/abs/2605.23859v1

Keywords: TTS, speech synthesis, anti-spoofing detection, EMA, prompt selection, WildSpoof, F5-TTS, deepfake, speech security

## Background: The Arms Race Between Speech Synthesis and Anti-Spoofing Detection

In recent years, Text-to-Speech (TTS) technology has made breakthrough progress, with synthetic speech naturalness approaching human levels, but it also brings security challenges of rampant deepfake speech. Speech Anti-Spoofing (SASV) systems attempt to distinguish between real and synthetic speech, but the technical competition is far from over: as detection systems upgrade, more advanced TTS models are looking for breakthroughs. The WildSpoof Challenge requires training TTS models on real-scenario data to generate synthetic speech that is both natural and difficult to be identified by existing detection systems, and F5-TTS-DPS is the winning solution for the TTS track of this competition.

## Technical Solution: EMA-Stabilized Training and Dual-Score Prompt Selection

F5-TTS-DPS is based on the F/5-TTS architecture, with core innovations including:

1. EMA-enhanced Supervised Fine-Tuning (SFT): Traditional SFT is prone to parameter oscillations. EMA maintains a smoothed parameter copy (θ_EMA(t) = α·θ_EMA(t-1)+(1-α)·θ(t)), suppressing noise disturbances and improving generalization ability.

2. Dual-Score Prompt Selection: Using LLM to evaluate the grammar, semantics, and naturalness of text prompts, and LALM to assess the acoustic quality, clarity, and text alignment of reference audio. Dual filtering ensures high-quality training data.

## Experimental Results: Performance of High Naturalness and Undetectability

Performance of F5-TTS-DPS on WildSpoof2026 development set:

| Metric | Value | Description |
|--------|-------|-------------|
| UTMOS | 3.20 | Speech naturalness score (higher means more natural) |
| Speaker Similarity | 0.51 | Similarity between synthetic speech and target speaker |
| WER | Competitive level | Word Error Rate, reflecting pronunciation accuracy |

a-DCF scores on three advanced SASV detection systems (lower means harder to detect):

| Detection System | a-DCF Score | Rank |
|------------------|-------------|------|
| System 1 | 0.1582 | 1st |
| System 2 | 0.5233 | 1st |
| System 3 | 0.2562 | 1st |

## Technical Insights: Balance Between Naturalness and Deceptiveness and Application Significance

Research reveals that the boundary between naturalness and deceptiveness is blurred. The traditional view holds that high naturalness is easy to detect, but F5-TTS-DPS achieves a balance between the two through designed training strategies. Key technologies: EMA-stabilized training, dual-score data filtering. Application significance:
1. Positive aspects: Provides new ideas for high-quality personalized speech synthesis (voice assistants, audiobooks, etc.);
2. Security challenges: Existing detection systems need to accelerate upgrades to deal with new-generation TTS threats.

## Conclusion & Outlook: Technical Game Drives Domain Progress

The excellent performance of F5-TTS-DPS in WildSpoof2026 marks that TTS technology has entered a new stage. More technological innovations will emerge in the future, and the speech security field needs to continuously evolve to deal with the threat of realistic synthetic speech. The technical game drives the common progress of both sides.
