Zing Forum

Reading

Abbott-Costello-Benchmark: Evaluating Large Language Models' Cultural Understanding Ability Using Classic Comedy Dialogues

An open-source benchmark based on the classic comedy dialogues of Abbott and Costello, specifically designed to evaluate large language models' capabilities in personality analysis, character distinction, cultural context understanding, and more.

大语言模型基准测试人格分析文化理解Abbott and CostelloAI评估自然语言处理
Published 2026-03-28 22:16Recent activity 2026-03-28 22:19Estimated read 4 min
Abbott-Costello-Benchmark: Evaluating Large Language Models' Cultural Understanding Ability Using Classic Comedy Dialogues
1

Section 01

Abbott-Costello-Benchmark: Evaluating LLM Cultural Understanding Ability Using Classic Comedy Dialogues

This article introduces the Abbott-Costello-Benchmark, an open-source benchmark that uses dialogues from the classic comedy duo Abbott and Costello as materials. It specifically evaluates large language models (LLMs) in terms of personality analysis, character distinction, cultural context understanding, and other capabilities, filling the gap in traditional benchmarks that ignore cultural and social context comprehension.

2

Section 02

Project Background and Motivation

Traditional LLM benchmarks (such as GLUE, SuperGLUE) focus on tasks like knowledge retrieval and reasoning, but lack evaluation of cultural context, personality traits, and linguistic humor. Abbott and Costello's comedy dialogues are known for wordplay, distinct character contrasts, and cultural connotations, making them suitable as test materials to examine models' relevant capabilities.

3

Section 03

Test Framework Design

The test inputs 20 classic dialogues into the model. The model needs to generate scores for 8 personality traits (directness, emotional expression, warmth, etc.) and 7 environmental variables (educational level, income, etc.), then compare them with reference personality cards to calculate evaluation metrics.

4

Section 04

Establishment of Reference Standards and Source of Materials

The reference standards are obtained by taking the average of three iterations each from three models: Claude Sonnet 4.6, GPT-4o, and Gemini 1.5 Pro. The dialogue materials are from the Generic Radio Workshop Vintage Radio Script Library, including classic works like 'Christmas Turkey', 'Lion Hunting', and 'Who's on First?'.

5

Section 05

Test Difficulty Levels

The 55 test dialogues are divided into three levels based on cognitive challenge types: easy (12), medium (23), and hard (20). They cover six dimensions including wordplay, character dynamics, and cultural references. The diverse difficulty design allows for a comprehensive evaluation of model performance.

6

Section 06

Evaluation Metrics and Output Format Requirements

Metrics such as Mean Absolute Error (MAE), cosine similarity, accuracy, character distinction, and weighted total score are used. The model is required to generate structured JSON output, suitable for LLMs with reliable formatting capabilities.

7

Section 07

Practical Significance and Application Prospects

This benchmark provides a new perspective for LLM evaluation, helping researchers identify and improve models' shortcomings in cultural understanding, so that users can obtain models that better understand human contexts.

8

Section 08

Conclusion

The Abbott-Costello-Benchmark solves AI evaluation challenges with creative and rigorous methods, promoting the development of LLMs toward better understanding of human culture and emotions.