# BloomBench: A Bilingual Visual-Language Model Evaluation Benchmark Based on Bloom's Taxonomy of Cognitive Objectives

> BloomBench, developed by the Qatar Computing Research Institute (QCRI), is a bilingual (English-Arabic) multimodal evaluation benchmark. It systematically assesses the reasoning capabilities of visual-language models (VLMs) across six cognitive levels based on Bloom's Taxonomy of Cognitive Objectives, revealing the cognitive asymmetry of current VLMs in cross-lingual multimodal reasoning.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-06T18:45:43.000Z
- 最近活动: 2026-06-06T18:50:45.013Z
- 热度: 152.9
- 关键词: 视觉语言模型, 评测基准, 布鲁姆认知分类法, 多模态, 双语评测, 阿拉伯语, 认知推理, 机器学习, 人工智能
- 页面链接: https://www.zingnex.cn/en/forum/thread/bloombench
- Canonical: https://www.zingnex.cn/forum/thread/bloombench
- Markdown 来源: floors_fallback

---

## [Introduction] BloomBench: A Bilingual Visual-Language Model Evaluation Benchmark Based on Bloom's Taxonomy of Cognitive Objectives

The Qatar Computing Research Institute (QCRI) launched BloomBench on June 6, 2026. It is a bilingual (English-Arabic) multimodal evaluation benchmark based on Bloom's Taxonomy of Cognitive Objectives, designed to systematically assess the reasoning capabilities of visual-language models (VLMs) across six cognitive levels: memory, comprehension, application, analysis, evaluation, and creation. It reveals the cognitive asymmetry of current VLMs in cross-lingual multimodal reasoning.

Source Information:
- Original Author/Maintainer: QCRI
- Source Platform: GitHub
- Original Title: Almieyar-Oryx-BloomBench
- Original Link: https://github.com/qcri/Almieyar-Oryx-BloomBench
- Paper Link: https://arxiv.org/abs/2606.05531
- Dataset: https://huggingface.co/datasets/QCRI/BloomBench
- Release Date: June 6, 2026

## Background: Current VLM Evaluations Lack Systematic Diagnosis of Cognitive Capabilities

Most current evaluation benchmarks for visual-language models (VLMs) focus on isolated tasks or overall accuracy, lacking systematic diagnosis of models' cognitive capabilities. Most benchmarks fail to answer key questions: How do models perform across different cognitive levels? Do they truly understand content, or just perform pattern matching? To address this issue, QCRI launched BloomBench to analyze the distribution of models' capabilities across six cognitive levels, rather than just focusing on final accuracy.

## Methodology: Transformation of Bloom's Cognitive Levels and Data Generation Process

BloomBench converts the six levels of the revised Bloom's Taxonomy into specific visual question-answering (VQA) tasks:
1. Memory: Basic perceptual abilities such as identifying/recalling objects and attributes in images;
2. Comprehension: Combinatorial/relational understanding (semantics, emotion, etc.);
3. Application: Applying knowledge/rules in new scenarios (e.g., negation reasoning);
4. Analysis: Decomposition and reasoning (logic, context, chart analysis, etc.);
5. Evaluation: Judgment abilities (consistency checks, safety assessments, etc.);
6. Creation: Discriminative creativity (selecting the best synthetic result from options).

Data Generation Process: Combines scenario design and cognitive-oriented Q&A generation using Gemini 2.5 Pro,配合多选题转换器 and Arabic translator, with quality verified via LLM-as-judge + human arbitration. All samples are four-option multiple-choice questions with distractor options; images are collected from the web and ensure semantic alignment in translation.

## Evidence: Dataset Scale and Quality Control

BloomBench contains 7747 bilingual image-question-answer samples, covering 106 task types and all six cognitive levels:
- Memory: 2948 samples
- Comprehension: 1592 samples
- Application: 499 samples
- Analysis: 1431 samples
- Evaluation: 592 samples
- Creation: 685 samples

Quality Control: A stratified sample of 969 samples (about 1/8) was audited using Gemini 3 Pro, with only 15 errors. After human verification, the quality rate reached 98.45%.

## Findings: Cognitive Asymmetry and Cross-Lingual Gaps in VLMs

BloomBench supports two scoring modes:
1. RAE (Regular Expression Answer Extraction): Parses free output options to reflect user scenarios;
2. LBS (Likelihood-Based Scoring): Uses length-normalized conditional log probability for scoring to reduce format dependency.

Key Findings:
- Gemma4 31B leads in RAE accuracy (89.8% in English/87.6% in Arabic) but struggles in LBS;
- Qwen2.5-VL-7B has the strongest internal consistency; the Gemma3 series shows inverse scaling in LBS (27B has the highest RAE but the steepest LBS drop);
- Arabic lags behind English across the board, with the Gemma3 series having the smallest cross-lingual gap; Spanish ablation experiments confirm the gap stems from tokenization fertility and non-English probability priors.

## Implications: Insights and Recommendations for VLM Development

Insights from BloomBench for VLM development:
1. Uneven cognitive capability distribution: Discriminative skills (e.g., comprehension, evaluation) are strong, but factual recall, procedural application, and creative synthesis are weak;
2. Persistent cross-lingual gaps: The Arabic-English gap poses challenges for multilingual applications;
3. Importance of evaluation methods: It is recommended to report both RAE and LBS for comprehensive assessment.

## Conclusion: Value of Cognitive-Oriented Evaluation

BloomBench provides a cognitive-oriented VLM evaluation framework that focuses not only on 'accuracy' but also on 'performance across cognitive levels'. This fine-grained diagnosis helps understand the strengths and limitations of VLMs and guides model improvement. As multimodal AI becomes more prevalent, such cognitive evaluation benchmarks will play an important role in ensuring AI reliability and safety.
