# Evaluating Speech Recognition with Generative Large Language Models: A New Paradigm for Semantic Evaluation Beyond Word Error Rate

> Traditional speech recognition systems rely on Word Error Rate (WER) for evaluation, but this metric is insensitive to semantics. This paper explores using generative large language models for semantic-level ASR evaluation, achieving 92-94% human agreement on the hypothesis selection task—significantly better than WER's 63%.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-23T17:59:47.000Z
- 最近活动: 2026-04-24T05:18:02.405Z
- 热度: 137.7
- 关键词: ASR, 语音识别, 大语言模型, 语义评测, 词错误率, 生成式AI, 自然语言处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-21928v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-21928v1
- Markdown 来源: floors_fallback

---

## Introduction: Generative LLMs Unlock a New Paradigm for Semantic ASR Evaluation

Traditional Automatic Speech Recognition (ASR) systems rely on Word Error Rate (WER) for evaluation, but WER is insensitive to semantics. This paper explores using generative Large Language Models (LLMs) for semantic-level ASR evaluation, achieving 92-94% human agreement on the hypothesis selection task—significantly better than WER's 63%—and providing a new direction for ASR evaluation beyond traditional metrics.

## Background: Semantic Gap in ASR Evaluation and Practical Needs

ASR technology has made significant progress, but evaluation methods still rely on WER (a string-matching metric). WER has mismatches between semantics and strings: for example, when "recognize speech" is recognized as "wreck a nice beach", WER marks it as a severe error, but the semantics may be similar; when "don't turn left" is recognized as "don't turn right", the WER difference is small, but the actual consequences are serious. In real-world scenarios, users care more about intent (e.g., "500 milligrams" and "500 mg" are semantically equivalent in medical contexts). Existing embedding-based semantic evaluation lacks deep understanding, so the potential of generative LLMs remains to be explored.

## Methodology: Detailed Explanation of Three LLM Evaluation Strategies

The study designs three complementary methods: 1. Hypothesis selection task: Given two candidate results, the LLM judges their quality using the HATS manually annotated dataset; 2. Generative embedding semantic distance: Using decoder LLM embeddings to calculate semantic similarity; 3. Error classification and interpretability analysis: The LLM scores and explains error types and their impacts to facilitate system iteration.

## Experimental Results: LLM Performance Significantly Outperforms Traditional Metrics

On the HATS dataset, the LLM achieved 92-94% human agreement on the hypothesis selection task—far higher than WER's 63%—and outperformed existing embedding-based semantic metrics. Generative embeddings performed on par with or even better than dedicated encoders. LLMs can classify and explain errors in fine granularity (e.g., synonym replacement, semantic drift).

## Technical Details: Model, Prompt, and Efficiency Optimization

Model selection: Large-scale LLMs perform better, but medium-scale ones can also meet requirements; Prompt engineering: Chain-of-thought prompts improve accuracy; Computational efficiency: Balancing quality and cost through batch processing, quantization, and distillation.

## Limitations and Future Research Directions

Limitations: Domain specificity (HATS is for general scenarios), language coverage (mainly English), bias and fairness, computational resource constraints. Future directions: Lightweight evaluation LLMs, multimodal evaluation (combining audio), standardized semantic benchmarks.

## Conclusions and Implications: ASR Evaluation Needs to Shift to Semantic Awareness

Generative LLMs address the disconnect between WER and user experience, opening up a new paradigm for ASR evaluation. Implications: Practitioners should focus on semantic accuracy; LLMs can serve as quality gatekeepers, promote end-to-end semantic optimization, and help popularize voice interaction.