Zing Forum

Reading

BloomBench: A Bilingual Visual-Language Model Evaluation Benchmark Based on Bloom's Taxonomy of Cognitive Objectives

BloomBench, developed by the Qatar Computing Research Institute (QCRI), is a bilingual (English-Arabic) multimodal evaluation benchmark. It systematically assesses the reasoning capabilities of visual-language models (VLMs) across six cognitive levels based on Bloom's Taxonomy of Cognitive Objectives, revealing the cognitive asymmetry of current VLMs in cross-lingual multimodal reasoning.

视觉语言模型评测基准布鲁姆认知分类法多模态双语评测阿拉伯语认知推理机器学习人工智能
Published 2026-06-07 02:45Recent activity 2026-06-07 02:50Estimated read 8 min
BloomBench: A Bilingual Visual-Language Model Evaluation Benchmark Based on Bloom's Taxonomy of Cognitive Objectives
1

Section 01

[Introduction] BloomBench: A Bilingual Visual-Language Model Evaluation Benchmark Based on Bloom's Taxonomy of Cognitive Objectives

The Qatar Computing Research Institute (QCRI) launched BloomBench on June 6, 2026. It is a bilingual (English-Arabic) multimodal evaluation benchmark based on Bloom's Taxonomy of Cognitive Objectives, designed to systematically assess the reasoning capabilities of visual-language models (VLMs) across six cognitive levels: memory, comprehension, application, analysis, evaluation, and creation. It reveals the cognitive asymmetry of current VLMs in cross-lingual multimodal reasoning.

Source Information:

2

Section 02

Background: Current VLM Evaluations Lack Systematic Diagnosis of Cognitive Capabilities

Most current evaluation benchmarks for visual-language models (VLMs) focus on isolated tasks or overall accuracy, lacking systematic diagnosis of models' cognitive capabilities. Most benchmarks fail to answer key questions: How do models perform across different cognitive levels? Do they truly understand content, or just perform pattern matching? To address this issue, QCRI launched BloomBench to analyze the distribution of models' capabilities across six cognitive levels, rather than just focusing on final accuracy.

3

Section 03

Methodology: Transformation of Bloom's Cognitive Levels and Data Generation Process

BloomBench converts the six levels of the revised Bloom's Taxonomy into specific visual question-answering (VQA) tasks:

  1. Memory: Basic perceptual abilities such as identifying/recalling objects and attributes in images;
  2. Comprehension: Combinatorial/relational understanding (semantics, emotion, etc.);
  3. Application: Applying knowledge/rules in new scenarios (e.g., negation reasoning);
  4. Analysis: Decomposition and reasoning (logic, context, chart analysis, etc.);
  5. Evaluation: Judgment abilities (consistency checks, safety assessments, etc.);
  6. Creation: Discriminative creativity (selecting the best synthetic result from options).

Data Generation Process: Combines scenario design and cognitive-oriented Q&A generation using Gemini 2.5 Pro,配合多选题转换器 and Arabic translator, with quality verified via LLM-as-judge + human arbitration. All samples are four-option multiple-choice questions with distractor options; images are collected from the web and ensure semantic alignment in translation.

4

Section 04

Evidence: Dataset Scale and Quality Control

BloomBench contains 7747 bilingual image-question-answer samples, covering 106 task types and all six cognitive levels:

  • Memory: 2948 samples
  • Comprehension: 1592 samples
  • Application: 499 samples
  • Analysis: 1431 samples
  • Evaluation: 592 samples
  • Creation: 685 samples

Quality Control: A stratified sample of 969 samples (about 1/8) was audited using Gemini 3 Pro, with only 15 errors. After human verification, the quality rate reached 98.45%.

5

Section 05

Findings: Cognitive Asymmetry and Cross-Lingual Gaps in VLMs

BloomBench supports two scoring modes:

  1. RAE (Regular Expression Answer Extraction): Parses free output options to reflect user scenarios;
  2. LBS (Likelihood-Based Scoring): Uses length-normalized conditional log probability for scoring to reduce format dependency.

Key Findings:

  • Gemma4 31B leads in RAE accuracy (89.8% in English/87.6% in Arabic) but struggles in LBS;
  • Qwen2.5-VL-7B has the strongest internal consistency; the Gemma3 series shows inverse scaling in LBS (27B has the highest RAE but the steepest LBS drop);
  • Arabic lags behind English across the board, with the Gemma3 series having the smallest cross-lingual gap; Spanish ablation experiments confirm the gap stems from tokenization fertility and non-English probability priors.
6

Section 06

Implications: Insights and Recommendations for VLM Development

Insights from BloomBench for VLM development:

  1. Uneven cognitive capability distribution: Discriminative skills (e.g., comprehension, evaluation) are strong, but factual recall, procedural application, and creative synthesis are weak;
  2. Persistent cross-lingual gaps: The Arabic-English gap poses challenges for multilingual applications;
  3. Importance of evaluation methods: It is recommended to report both RAE and LBS for comprehensive assessment.
7

Section 07

Conclusion: Value of Cognitive-Oriented Evaluation

BloomBench provides a cognitive-oriented VLM evaluation framework that focuses not only on 'accuracy' but also on 'performance across cognitive levels'. This fine-grained diagnosis helps understand the strengths and limitations of VLMs and guides model improvement. As multimodal AI becomes more prevalent, such cognitive evaluation benchmarks will play an important role in ensuring AI reliability and safety.