Zing Forum

Reading

ConnectionsBench: A Benchmark Suite for Evaluating Semantic Grouping and Lateral Reasoning Capabilities of Large Language Models

ConnectionsBench is a benchmark suite specifically designed to evaluate the performance of large language models (LLMs) on The New York Times Connections puzzles. It tests models' semantic grouping and lateral reasoning abilities using over 1000 puzzles of varying difficulty levels.

基准测试大语言模型评估语义推理横向思维Connections谜题AI能力测试开源工具
Published 2026-04-22 18:53Recent activity 2026-04-22 19:26Estimated read 5 min
ConnectionsBench: A Benchmark Suite for Evaluating Semantic Grouping and Lateral Reasoning Capabilities of Large Language Models
1

Section 01

Introduction: Core Overview of the ConnectionsBench Benchmark Suite",

ConnectionsBench is a dedicated benchmark suite for evaluating the semantic grouping and lateral reasoning capabilities of large language models (LLMs). Designed based on The New York Times Connections puzzles, it contains over 1000 puzzles of varying difficulty levels, aiming to fill the gap in traditional LLM evaluations regarding complex semanticantic reasoning ability testing.

2

Section 02

Background: Why Do We Need Specialized Reasoning Ability Evaluation?

As LLMs achieve results in standardized tests and academic benchmarks, traditional NLP benchmarks focus on language understanding, knowledge retrieval, or text generation, but lack sufficient evaluation of complex semantic grouping and lateral reasoning abilities. The New York Times Connections puzzles require identifying 4 sets of semantic associations from 16 seemingly random words, which demands semantic grouping ability, lateral reasoning ability, and multi-level difficulty handling ability—hence the design of this benchmark.

3

Section 03

Methodology: Test Design and Difficulty Grading

ConnectionsBench includes over 1000 puzzles, divided into four difficulty levels: Yellow (Easy, intuitive semantic associations like pet categories), Green (Medium, requiring polysemy or professional knowledge), Blue (Hard, involving cultural references or abstract associations), and Purple (Extremely Hard, containing puns or specialized knowledge). The grading system allows precise analysis of models' performance under different reasoning complexities.

4

Section 04

Methodology: Evaluation Methodology

The evaluation methods include: 1. Complete puzzle solving evaluation (requiring correct identification of all four sets of associations); 2. Progressive difficulty analysis (statistical accuracy by level); 3. Error pattern analysis (recording error types such as irrelevant combinations, failure in specific associations, and being misled by distractors).

5

Section 05

Significance: Value for LLM Research

ConnectionsBench fills the gap in traditional benchmarks (e.g., MMLU does not fully evaluate semantic grouping and lateral reasoning); reveals models' true reasoning abilities (distinguishing between semantic understanding and statistical co-occurrence); evaluates creative association abilities (Purple-level puzzles); and provides a standardized tool for cross-model comparison.

6

Section 06

Current Status and Future Development

The project is in an active development phase. The scaffolding has been completed, and functions such as data pipelines, model loaders, scorers, CLI tools, first benchmark test runs, result analysis, and leaderboards are currently under development.

7

Section 07

Value: Impact on the AI Research Community

It promotes the shift from general ability evaluation to specific cognitive ability evaluation; helps model developers identify improvement directions (e.g., improving representation learning if semantic grouping is poor); and assists AI safety researchers in evaluating models' reasoning boundaries and risks.

8

Section 08

Conclusion: Future Outlook of ConnectionsBench

As a benchmark focusing on specific cognitive abilities, ConnectionsBench provides a standardized platform. With ongoing development, it is expected to become an important component of the LLM evaluation toolbox, helping to understand models' capabilities and limitations.