# Abbott-Costello-Benchmark: Evaluating Large Language Models' Cultural Understanding Ability Using Classic Comedy Dialogues

> An open-source benchmark based on the classic comedy dialogues of Abbott and Costello, specifically designed to evaluate large language models' capabilities in personality analysis, character distinction, cultural context understanding, and more.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-28T14:16:36.000Z
- 最近活动: 2026-03-28T14:19:24.733Z
- 热度: 157.9
- 关键词: 大语言模型, 基准测试, 人格分析, 文化理解, Abbott and Costello, AI评估, 自然语言处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/abbott-costello-benchmark
- Canonical: https://www.zingnex.cn/forum/thread/abbott-costello-benchmark
- Markdown 来源: floors_fallback

---

## Abbott-Costello-Benchmark: Evaluating LLM Cultural Understanding Ability Using Classic Comedy Dialogues

This article introduces the Abbott-Costello-Benchmark, an open-source benchmark that uses dialogues from the classic comedy duo Abbott and Costello as materials. It specifically evaluates large language models (LLMs) in terms of personality analysis, character distinction, cultural context understanding, and other capabilities, filling the gap in traditional benchmarks that ignore cultural and social context comprehension.

## Project Background and Motivation

Traditional LLM benchmarks (such as GLUE, SuperGLUE) focus on tasks like knowledge retrieval and reasoning, but lack evaluation of cultural context, personality traits, and linguistic humor. Abbott and Costello's comedy dialogues are known for wordplay, distinct character contrasts, and cultural connotations, making them suitable as test materials to examine models' relevant capabilities.

## Test Framework Design

The test inputs 20 classic dialogues into the model. The model needs to generate scores for 8 personality traits (directness, emotional expression, warmth, etc.) and 7 environmental variables (educational level, income, etc.), then compare them with reference personality cards to calculate evaluation metrics.

## Establishment of Reference Standards and Source of Materials

The reference standards are obtained by taking the average of three iterations each from three models: Claude Sonnet 4.6, GPT-4o, and Gemini 1.5 Pro. The dialogue materials are from the Generic Radio Workshop Vintage Radio Script Library, including classic works like 'Christmas Turkey', 'Lion Hunting', and 'Who's on First?'.

## Test Difficulty Levels

The 55 test dialogues are divided into three levels based on cognitive challenge types: easy (12), medium (23), and hard (20). They cover six dimensions including wordplay, character dynamics, and cultural references. The diverse difficulty design allows for a comprehensive evaluation of model performance.

## Evaluation Metrics and Output Format Requirements

Metrics such as Mean Absolute Error (MAE), cosine similarity, accuracy, character distinction, and weighted total score are used. The model is required to generate structured JSON output, suitable for LLMs with reliable formatting capabilities.

## Practical Significance and Application Prospects

This benchmark provides a new perspective for LLM evaluation, helping researchers identify and improve models' shortcomings in cultural understanding, so that users can obtain models that better understand human contexts.

## Conclusion

The Abbott-Costello-Benchmark solves AI evaluation challenges with creative and rigorous methods, promoting the development of LLMs toward better understanding of human culture and emotions.
