Reading

F5-TTS-DPS: Achieving Undetectable High-Naturalness Speech Synthesis via EMA-Stabilized Training and Dual-Score Prompt Selection

This article introduces F5-TTS-DPS, the winning solution for the TTS track of the WildSpoof 2026 Challenge. Based on the F5-TTS architecture, this model incorporates Exponential Moving Average (EMA) and a dual-score prompt selection mechanism based on LLM/LALM. It achieved the best a-DCF scores on three advanced SASV detection systems, generating synthetic speech with extremely high naturalness that is difficult to detect and identify.

TTS语音合成反欺骗检测EMA提示选择WildSpoofF5-TTS深度伪造语音安全

Published 2026-05-23 01:18Recent activity 2026-05-25 14:18Estimated read 6 min

F5-TTS-DPS: Achieving Undetectable High-Naturalness Speech Synthesis via EMA-Stabilized Training and Dual-Score Prompt Selection

Section 01

F5-TTS-DPS: Guide to the Winning Solution for WildSpoof2026 TTS Track

This article introduces F5-TTS-DPS, the winning solution for the TTS track of the WildSpoof 2026 Challenge. Based on the F5-TTS architecture, this model incorporates Exponential Moving Average (EMA) and a dual-score prompt selection mechanism. It achieved the best a-DCF scores on three advanced SASV detection systems, generating speech with high naturalness that is difficult to detect.

Original author team: WildSpoof 2026 TTS track participating team Source platform: arXiv Release date: May 22, 2026 Original link: http://arxiv.org/abs/2605.23859v1

Keywords: TTS, speech synthesis, anti-spoofing detection, EMA, prompt selection, WildSpoof, F5-TTS, deepfake, speech security

Section 02

Background: The Arms Race Between Speech Synthesis and Anti-Spoofing Detection

In recent years, Text-to-Speech (TTS) technology has made breakthrough progress, with synthetic speech naturalness approaching human levels, but it also brings security challenges of rampant deepfake speech. Speech Anti-Spoofing (SASV) systems attempt to distinguish between real and synthetic speech, but the technical competition is far from over: as detection systems upgrade, more advanced TTS models are looking for breakthroughs. The WildSpoof Challenge requires training TTS models on real-scenario data to generate synthetic speech that is both natural and difficult to be identified by existing detection systems, and F5-TTS-DPS is the winning solution for the TTS track of this competition.

Section 03

Technical Solution: EMA-Stabilized Training and Dual-Score Prompt Selection

F5-TTS-DPS is based on the F/5-TTS architecture, with core innovations including:

EMA-enhanced Supervised Fine-Tuning (SFT): Traditional SFT is prone to parameter oscillations. EMA maintains a smoothed parameter copy (θ_EMA(t) = α·θ_EMA(t-1)+(1-α)·θ(t)), suppressing noise disturbances and improving generalization ability.
Dual-Score Prompt Selection: Using LLM to evaluate the grammar, semantics, and naturalness of text prompts, and LALM to assess the acoustic quality, clarity, and text alignment of reference audio. Dual filtering ensures high-quality training data.

Section 04

Experimental Results: Performance of High Naturalness and Undetectability

Performance of F5-TTS-DPS on WildSpoof2026 development set:

Metric	Value	Description
UTMOS	3.20	Speech naturalness score (higher means more natural)
Speaker Similarity	0.51	Similarity between synthetic speech and target speaker
WER	Competitive level	Word Error Rate, reflecting pronunciation accuracy

a-DCF scores on three advanced SASV detection systems (lower means harder to detect):

Detection System	a-DCF Score	Rank
System 1	0.1582	1st
System 2	0.5233	1st
System 3	0.2562	1st

Section 05

Technical Insights: Balance Between Naturalness and Deceptiveness and Application Significance

Research reveals that the boundary between naturalness and deceptiveness is blurred. The traditional view holds that high naturalness is easy to detect, but F5-TTS-DPS achieves a balance between the two through designed training strategies. Key technologies: EMA-stabilized training, dual-score data filtering. Application significance:

Positive aspects: Provides new ideas for high-quality personalized speech synthesis (voice assistants, audiobooks, etc.);
Security challenges: Existing detection systems need to accelerate upgrades to deal with new-generation TTS threats.

Section 06

Conclusion & Outlook: Technical Game Drives Domain Progress

The excellent performance of F5-TTS-DPS in WildSpoof2026 marks that TTS technology has entered a new stage. More technological innovations will emerge in the future, and the speech security field needs to continuously evolve to deal with the threat of realistic synthetic speech. The technical game drives the common progress of both sides.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15