Zing Forum

Reading

Multimodal Large Language Models Playing Tetris: Benchmark Tests Reveal the True Capabilities of Visual Reasoning

A groundbreaking study systematically evaluated the visual understanding and spatial reasoning capabilities of multimodal LLMs (such as GPT-4V, Gemini Pro Vision, and LLaVA-13b) by having them play Tetris, and established a $200 prize to incentivize the community to develop better prompt strategies.

多模态大语言模型视觉推理俄罗斯方块基准测试GPT-4VGemini Pro VisionLLaVA提示工程AI Agent空间推理
Published 2026-04-26 08:37Recent activity 2026-04-26 08:48Estimated read 5 min
Multimodal Large Language Models Playing Tetris: Benchmark Tests Reveal the True Capabilities of Visual Reasoning
1

Section 01

[Main Post/Introduction] Multimodal Large Language Models Playing Tetris: Benchmark Tests Reveal the True Capabilities of Visual Reasoning

An open-source project called "Models Playing Tetris" systematically evaluates the visual understanding and spatial reasoning capabilities of multimodal large language models (including GPT-4V, Gemini Pro Vision, and LLaVA-13b) by having them play Tetris. It also sets up a $200 prize to incentivize the community to optimize prompt strategies, providing experimental data to understand the current boundaries of AI visual reasoning.

2

Section 02

Research Background and Motivation

With the development of vision-language models like GPT-4V and Gemini Pro Vision, the industry expects them to "understand" images and make decisions. However, most benchmark tests focus on static image understanding, lacking evaluation of dynamic interactive scenarios. Tetris requires continuous observation of the state, prediction of landing positions, and planning of operation sequences—it epitomizes the core skills of next-generation AI agents and fills this gap.

3

Section 03

Testing Methods and Experimental Design

Three models—GPT-4V, Gemini Pro Vision, and LLaVA-13b—were tested using four prompt strategies: basic prompt, few-shot learning (k=2), Chain of Thought (CoT), and CoT + few-shot combination. The core metric was "average number of placed blocks", with a random movement baseline (about 11.5 blocks) as a reference.

4

Section 04

Analysis of Key Experimental Results

  1. GPT-4V's best performance was 21.2 blocks (CoT + few-shot, multi-action per screenshot mode), significantly better than the random baseline; 2. Gemini Pro Vision's performance was highly volatile—its best was nearly 20 blocks, while some configurations were close to random, highlighting the decisive impact of prompt engineering; 3. LLaVA-13b's highest was 10.7 blocks, comparable to the random baseline, reflecting the capability gap between open-source and closed-source models.
5

Section 05

$200 Community Incentive Mechanism

The research team established a prize to award contributors who exceed the current best results (Gemini Pro Vision: 19.96 or GPT-4V:21.2) by at least 10 blocks. The prize amount is calculated using the formula min(2×achieved_pieces, 200) USD, to attract the community to optimize prompt strategies. The prize is still valid currently.

6

Section 06

Technical Implementation and Reproducibility

The project is implemented in Python, with dependencies managed by uv, and models called via the LiteLLM interface. It supports custom prompts (added to assets/prompts.json). It uses the zeroize318 open-source Tetris engine to ensure a stable environment. Experimental data is saved locally, and analysis tools are provided to statistics metrics such as performance and number of lines cleared.

7

Section 07

Implications for AI Development

This study shows the real level of multimodal AI in dynamic visual tasks: it has certain spatial planning and decision-making capabilities, but still has shortcomings in long-term planning and complex reasoning. It is crucial for the development of AI agents—only when they perform stably in controlled environments (like Tetris) can we expect their application in complex scenarios.