Zing Forum

Reading

Evaluating Speech Recognition with Generative Large Language Models: A New Paradigm for Semantic Evaluation Beyond Word Error Rate

Traditional speech recognition systems rely on Word Error Rate (WER) for evaluation, but this metric is insensitive to semantics. This paper explores using generative large language models for semantic-level ASR evaluation, achieving 92-94% human agreement on the hypothesis selection task—significantly better than WER's 63%.

ASR语音识别大语言模型语义评测词错误率生成式AI自然语言处理
Published 2026-04-24 01:59Recent activity 2026-04-24 13:18Estimated read 5 min
Evaluating Speech Recognition with Generative Large Language Models: A New Paradigm for Semantic Evaluation Beyond Word Error Rate
1

Section 01

Introduction: Generative LLMs Unlock a New Paradigm for Semantic ASR Evaluation

Traditional Automatic Speech Recognition (ASR) systems rely on Word Error Rate (WER) for evaluation, but WER is insensitive to semantics. This paper explores using generative Large Language Models (LLMs) for semantic-level ASR evaluation, achieving 92-94% human agreement on the hypothesis selection task—significantly better than WER's 63%—and providing a new direction for ASR evaluation beyond traditional metrics.

2

Section 02

Background: Semantic Gap in ASR Evaluation and Practical Needs

ASR technology has made significant progress, but evaluation methods still rely on WER (a string-matching metric). WER has mismatches between semantics and strings: for example, when "recognize speech" is recognized as "wreck a nice beach", WER marks it as a severe error, but the semantics may be similar; when "don't turn left" is recognized as "don't turn right", the WER difference is small, but the actual consequences are serious. In real-world scenarios, users care more about intent (e.g., "500 milligrams" and "500 mg" are semantically equivalent in medical contexts). Existing embedding-based semantic evaluation lacks deep understanding, so the potential of generative LLMs remains to be explored.

3

Section 03

Methodology: Detailed Explanation of Three LLM Evaluation Strategies

The study designs three complementary methods: 1. Hypothesis selection task: Given two candidate results, the LLM judges their quality using the HATS manually annotated dataset; 2. Generative embedding semantic distance: Using decoder LLM embeddings to calculate semantic similarity; 3. Error classification and interpretability analysis: The LLM scores and explains error types and their impacts to facilitate system iteration.

4

Section 04

Experimental Results: LLM Performance Significantly Outperforms Traditional Metrics

On the HATS dataset, the LLM achieved 92-94% human agreement on the hypothesis selection task—far higher than WER's 63%—and outperformed existing embedding-based semantic metrics. Generative embeddings performed on par with or even better than dedicated encoders. LLMs can classify and explain errors in fine granularity (e.g., synonym replacement, semantic drift).

5

Section 05

Technical Details: Model, Prompt, and Efficiency Optimization

Model selection: Large-scale LLMs perform better, but medium-scale ones can also meet requirements; Prompt engineering: Chain-of-thought prompts improve accuracy; Computational efficiency: Balancing quality and cost through batch processing, quantization, and distillation.

6

Section 06

Limitations and Future Research Directions

Limitations: Domain specificity (HATS is for general scenarios), language coverage (mainly English), bias and fairness, computational resource constraints. Future directions: Lightweight evaluation LLMs, multimodal evaluation (combining audio), standardized semantic benchmarks.

7

Section 07

Conclusions and Implications: ASR Evaluation Needs to Shift to Semantic Awareness

Generative LLMs address the disconnect between WER and user experience, opening up a new paradigm for ASR evaluation. Implications: Practitioners should focus on semantic accuracy; LLMs can serve as quality gatekeepers, promote end-to-end semantic optimization, and help popularize voice interaction.