# Study on the Film Taste of Large Language Models: Comparative Analysis of Preferences Among Eight Mainstream LLMs

> A groundbreaking study uses pairwise forced-choice experiments to reveal differences in film preferences among four major model families (Anthropic, OpenAI, Alibaba, and Mistral), finding that large language models exhibit a significant "critical acclaim orientation".

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T22:15:18.000Z
- 最近活动: 2026-06-11T22:21:22.711Z
- 热度: 141.9
- 关键词: 大语言模型, 电影推荐, 偏好分析, Bradley-Terry模型, AI伦理, 文化偏向, 影评倾向, 内容推荐系统
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-24f884b3
- Canonical: https://www.zingnex.cn/forum/thread/llm-24f884b3
- Markdown 来源: floors_fallback

---

## [Introduction] Study on Film Taste of Large Language Models: Core Summary of Preference Comparison Among Eight Mainstream LLMs

Original Authors: Jonghyun Jee and Aaron Shaw
Source: GitHub Project llm-film-preference (released on June 11, 2026)
Related Paper: Jee, J., & Shaw, A. (2026). Critical Acclaim Orientation in Large Language Models: Evidence from Film Preference Elicitation. AIES 2026.

Core Content:
This study uses pairwise forced-choice experiments to compare the film preferences of eight mainstream LLMs from four major model families (Anthropic, OpenAI, Alibaba, Mistral). It finds that all models exhibit a significant "critical acclaim orientation" (preferring films recognized by professional critics over popular commercial films), while there are subtle differences between different model families. The research has important implications for AI content recommendation and ethics.

## Research Background: Need to Explore Cultural Biases in LLMs

Large language models (LLMs) are profoundly transforming fields such as content recommendation and cultural analysis. However, the sources of "taste" absorbed during model training and whether there are aesthetic preferences of specific groups are key issues in understanding AI cultural biases.
Jonghyun Jee and Aaron Shaw conducted this study to reveal differences in LLMs' film preferences through quantitative methods and explore their cultural bias characteristics.

## Research Methods: Pairwise Comparison and Bradley-Terry Model

1. Experimental Design: 200 representative films were selected and divided into Group A (commercial + critically acclaimed), Group B (critically acclaimed only), and Group C (pure commercial films).
2. Core Method: Pairwise forced-choice comparisons (up to 4000 times) combined with the Bradley-Terry model to estimate the intensity of film preferences and eliminate differences in rating standards.
3. Participating Models: 8 models from four families (Anthropic Claude series, OpenAI GPT-5.4 series, Alibaba Qwen2.5 series, Mistral Small/Large), all run at temperature=0 to ensure reproducibility.

## Key Findings: Universal Critical Acclaim Orientation and Family Differences

- Commonality: All 8 models significantly prefer critically acclaimed films. Even when controlling for factors like era and region, the coefficient of the "critically acclaimed" variable remains positive and significant.
- Differences: OpenAI models have the highest internal consistency; Alibaba Qwen series has additional preferences for films from specific regions; Mistral's lightweight and main models show obvious preference differences, suggesting that model size affects cultural taste. These differences may stem from varying proportions of training data corpora.

## Practical Significance: Implications for AI Applications and Ethics

- Application Developers: Content recommendation products need to pay attention to the "elitist" taste bias of LLMs; directly using model outputs may deviate from ordinary users' preferences.
- AI Ethics: Reveals that training data shapes the cultural values of models. LLMs are not neutral tools; they carry the aesthetic judgments of professional critics, and this bias may extend to fields like literature and art.

## Reproducibility and Future Considerations

- Reproducibility: Three levels of reproduction paths are provided (Level1: Precomputed analysis; Level2: Raw data reaggregation; Level3: Running experiments from scratch requires API keys). The code is implemented in Python/R with clear dependencies.
- Future Directions: Explore whether similar biases exist in other cultural fields (music, literature); study how to mitigate biases through fine-tuning/prompt engineering; consider the balance between "elite taste" and "mass preferences" in AI systems.