Zing 论坛

正文

NVMOS:语音中非语言发声质量评估的首个专用模型

研究团队构建了首个非语言发声质量数据集,发现通用多模态模型无法可靠评估NV质量,进而提出NVMOS模型,通过局部NV事件聚焦模块实现专家级或更强的人机一致性,填补了语音合成质量评估的重要空白。

非语言发声语音质量评估NVMOS文本到语音多模态大模型MOS评分语音合成声学质量
发布时间 2026/06/15 00:18最近活动 2026/06/16 11:00预计阅读 7 分钟
NVMOS:语音中非语言发声质量评估的首个专用模型
1

章节 01

NVMOS: The First Dedicated Model for Non-Verbal Vocalization Quality Assessment

Core Points:

  • Non-Verbal Vocalizations (NVs, e.g., laughter, sighs) are critical for natural TTS but their quality assessment was long ignored.
  • Research team built the first NV quality dataset (NV-MOS) and found general multimodal models (like Gemini) can't reliably evaluate NV quality.
  • Proposed NVMOS model with a local NV event focus module, achieving expert-level or better human-machine consistency.
  • Fills an important gap in speech synthesis quality assessment.

Source Info:

2

章节 02

Research Background: Overlooked NV Quality in Speech Synthesis

Why NVs Matter

NVs (laughter, sighs, coughs, fillers) carry emotional/intentional info and directly impact TTS naturalness.

Limitations of Existing Methods

  1. Traditional评估 (PESQ/POLQA/MOS): Treat NVs as "accessories"—can't capture NV-specific quality issues.
  2. Current NV-TTS评估: Only checks type correctness and position accuracy, ignoring the NV's own perceptual quality (e.g., a mechanically-sounding laugh is bad even if type/position are right).

This gap led to the need for a dedicated NV quality assessment solution.

3

章节 03

NV-MOS Dataset: First Specialized Dataset for NV Quality

Dataset Composition

  • Synthetic samples: From multiple NV-TTS systems, covering various NV types (laughter, sighs, etc.) with a full quality spectrum (high to low).
  • Natural samples: Real recordings from diverse speakers/situations as reference.

Expert Annotation Process

  • Annotators: 3 acoustic experts (trained, blind to sample source).
  • Scoring: 5-point MOS scale (1=very poor,5=excellent).
  • Quality control: Check inter-annotator consistency and retest reliability; remove low-consistency samples.
4

章节 04

NVMOS Model: Architecture & Training Strategies

Core Architecture

  1. Local NV Event Focus Module:
    • Detects NV event positions/boundaries → extracts local acoustic features → uses attention to focus on these features → predicts MOS.
    • Advantages: Aligns with NV event granularity, avoids interference from other speech parts, efficient.
  2. Multi-scale Feature Fusion: Combines short-term (frame-level spectrum), medium-term (NV event韵律), long-term (contextual semantic) features.
  3. Expert Knowledge Integration: Uses expert MOS as supervision, contrast learning (distinguish high/low quality), multi-task learning (predict MOS + NV type).

Training Strategies

  • Data Augmentation: Time-domain (speed change), frequency-domain (noise addition), mix augmentation.
  • Loss Functions: MSE (fit MOS), ranking loss (relative quality), consistency loss (stable scores across augmentations).
5

章节 05

Experimental Results: Expert-Level Performance

Evaluation Metrics

PLCC (linear correlation), SRCC (ranking consistency), MSE (prediction error).

Key Findings

  1. Human-Machine Consistency:
    • PLCC >0.9 (high correlation with expert MOS), SRCC>0.88 (strong ranking consistency).
    • Sometimes outperforms single experts (more stable, aligns better with majority expert average).
  2. Comparison with Baselines:
    • vs Multimodal models (Gemini): PLCC from ~0.6→>0.9, MSE reduced by 50%+.
    • vs Traditional models (PESQ/MOSNet): Dedicated design brings significant improvements.
  3. Ablation Studies:
    • Removing local focus module →15% performance drop.
    • Multi-scale features and expert knowledge integration each contribute ~5% improvement.
6

章节 06

Applications, Limitations & Future Directions

Applications

  • TTS Development: Real-time quality monitoring, model comparison, iterative optimization.
  • Speech Data Management: Auto-filter high-quality NV samples, control training data quality.
  • User Experience: A/B testing for NV strategies, preference learning.

Limitations

  • Language-dependent (mostly English data).
  • Cultural differences in NV perception not fully considered.
  • Limited modeling of long-range context impact on NV quality.

Future Directions

  • Multi-language dataset expansion.
  • Fine-grained assessment (naturalness, expressiveness).
  • Real-time lightweight version.
  • Generate improvement suggestions for NV-TTS systems.