# DixitWorld: Evaluating the Abductive Reasoning Ability of Multimodal Vision-Language Models Using a Board Game

> An open-source project from an ACL 2026 paper that builds a multi-agent benchmark using the Dixit board game, revealing the structural asymmetry in hypothesis generation and selection tasks among current VLMs.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T13:36:32.000Z
- 最近活动: 2026-06-04T13:48:33.813Z
- 热度: 163.8
- 关键词: DixitWorld, ACL 2026, 多模态基准测试, 溯因推理, 视觉语言模型, 多智能体, 桌游 AI, 语用推理, 假设生成, VLM 评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/dixitworld
- Canonical: https://www.zingnex.cn/forum/thread/dixitworld
- Markdown 来源: floors_fallback

---

## [Introduction] DixitWorld: An Innovative Benchmark for Evaluating the Abductive Reasoning Ability of VLMs Using a Board Game

DixitWorld is an open-source project from an ACL 2026 paper. It constructs a multi-agent benchmark framework using the classic board game Dixit, aiming to evaluate the abductive reasoning ability of Vision-Language Models (VLMs) and reveal the structural asymmetry in hypothesis generation and selection tasks among current VLMs. The project includes the dynamic game environment DixitArena and the static dataset DixitBench, providing new tools for evaluating the high-order cognitive abilities of VLMs.

## Project Background: Why Do We Need DixitWorld?

Abductive reasoning is the ability to generate explanatory hypotheses from partial observations, which is a weak point in the 'understanding' aspect of current large models. Traditional VLM benchmarks are mostly static and lack evaluation of creative hypothesis generation and pragmatic reasoning. DixitWorld transforms the board game into a dynamic multi-agent game scenario, simulating human analogical reasoning and creative thinking, thus filling this evaluation gap.

## Core Methods: Analysis of the Dual-Component Architecture

1. **DixitArena**: A dynamic multi-agent game environment where agents take turns playing the roles of Storyteller (generating moderately ambiguous clues, corresponding to hypothesis generation) and Listener (selecting the target image based on clues, corresponding to hypothesis selection). The scoring mechanism requires models to master pragmatic skills.
2. **DixitBench**: A static multiple-choice question dataset containing 84 images and 3 difficulty levels. Distractors are controlled based on semantic similarity, and the correlation with the results of the DixitArena Listener task reaches Pearson r=0.947.

## Key Findings: Structural Asymmetry Between Generation and Selection Abilities

Evaluation of 6 mainstream VLMs (such as Qwen2.5-VL, GPT-4o, etc.) found:
- Storyteller task: Over 78% of rounds scored zero, as models struggle to balance the ambiguity and comprehensibility of clues, reflecting a deficit in pragmatic control;
- Listener task: The best model achieved an accuracy of approximately 75.6%, showing strong discriminator capabilities.
This reveals a significant ability gap between the 'generation' and 'selection' directions of VLMs.

## Technical Implementation: Modular Architecture and Usage Guide

The project uses a Python modular design:
- Core code: `src/game.py` (game engine), `src/agents.py` (agent definition), `src/call_api_*.py` (multi-API clients);
- Experiment scripts: `experiments/` supports Arena tournaments, batch testing, and DixitBench evaluation;
- Analysis tools: `analysis/` provides statistics and visualization. It supports API key configuration for OpenRouter, Together, etc., facilitating reproduction and expansion.

## Academic Value: Filling the Blind Spot in VLM Evaluation

DixitWorld has been recognized by ACL 2026, and its value includes:
1. Filling the gap in existing benchmarks for evaluating creative generation and pragmatic reasoning;
2. Providing a multi-agent collaboration/competition experimental platform to study communication games and emergent behaviors;
3. Guiding AI applications: Human supervision remains indispensable in scenarios such as education and creative writing.

## Limitations and Future Directions

Limitations: The 84 Dixit cards are not included due to copyright issues, increasing the threshold for reproduction; only English context is evaluated. Future directions: Improve Storyteller task performance through fine-tuning/prompt engineering; expand cross-language evaluation; optimize dataset copyright issues.

## Conclusion: Towards Higher-Order Cognitive AI Evaluation

DixitWorld represents an important step in AI benchmarks towards exploring higher-order cognitive abilities. Recognizing objects is just the beginning for models; the real challenge is understanding expressions like 'as mysterious as a cat'—balancing imagery association and interpretive space, which is a key threshold for naturalizing human-computer interaction.
