Zing Forum

Reading

Comparative Evaluation of Multimodal Image Captioning Models: Semantic Alignment Analysis of Open-Source vs. Commercial Solutions

This study evaluates the image captioning performance of two multimodal vision-language models—Gemini 2.5 Flash-Lite and Qwen3-VL-8B—on the Flickr8k dataset, using ROUGE-L and BERTScore metrics to analyze their semantic alignment capabilities and deployment trade-offs.

多模态模型图像描述视觉语言模型模型评测语义对齐开源vs商业Flickr8kBERTScoreROUGE-LGemini
Published 2026-04-30 15:38Recent activity 2026-04-30 15:53Estimated read 7 min
Comparative Evaluation of Multimodal Image Captioning Models: Semantic Alignment Analysis of Open-Source vs. Commercial Solutions
1

Section 01

Comparative Evaluation of Multimodal Image Captioning Models: Semantic Alignment Analysis of Open-Source vs. Commercial Solutions (Introduction)

This project evaluates the image captioning task performance of the commercial model Gemini 2.5 Flash-Lite and the open-source model Qwen3-VL-8B-Abliterated-Caption-it on the Flickr8k dataset. It analyzes their semantic alignment capabilities using ROUGE-L and BERTScore metrics, and discusses deployment-level trade-offs to provide references for developers and research teams in model selection.

2

Section 02

Research Background and Core Questions

Multimodal Large Language Models (MLLMs) are transforming the intersection of computer vision and natural language processing, yet developers face information asymmetry when choosing between commercial API and open-source local deployment solutions. Core question: How do commercial and open-source vision-language models perform in generating semantically accurate image captions under the same dataset? The two representative models compared in this study are: the commercial model Gemini 2.5 Flash-Lite (API access) and the open-source model Qwen3-VL-8B-Abliterated-Caption-it (local inference via Hugging Face).

3

Section 03

Evaluation Methods and Experimental Design

The Flickr8k dataset was selected (8,000 images, each with 5 human-written reference captions; samples were chosen using a fixed random seed to ensure fairness). Evaluation workflow: Load images → Apply standardized neutral prompts → Model generates captions → Store results → Calculate semantic metrics. Tech stack: Python, Google Colab, Hugging Face Transformers, ROUGE/BERTScore evaluation tools.

4

Section 04

Rationale for Evaluation Metric Selection

BLEU was initially considered, but we switched to methods that better reflect semantic similarity:

  1. ROUGE-L: Measures semantic similarity via the longest common subsequence, capturing sentence structure and word order;
  2. BERTScore: Uses pre-trained model contextual embeddings to calculate semantic similarity, providing precision, recall, and F1 scores. METEOR was not included in the final analysis due to implementation constraints.
5

Section 05

Evaluation Results and Key Findings

Overall performance: Gemini 2.5 Flash-Lite outperformed Qwen3-VL-8B in average ROUGE-L and BERTScore; BERTScore F1 indicated stronger semantic alignment in complex scenarios. Qwen3-VL-8B generated coherent captions but had high variance in action-dense scenes. Scene breakdown:

  • Person-centric scenes: Commercial model consistently captured relational dynamics, while open-source occasionally missed details;
  • Object-centric scenes: Performance was comparable between the two;
  • Complex interaction scenes: Commercial model had more accurate semantic alignment, while open-source tended to overgeneralize. Key observations: Commercial model had more consistent understanding of interpersonal relationships; open-source model occasionally produced incomplete descriptions of complex actions; the gap in object recognition was minimal.
6

Section 06

Deployment Trade-off Analysis

Commercial solution (Gemini) advantages: Accurate semantic alignment, no hardware investment needed, ready-to-use; Disadvantages: API rate limits, latency affected by network, cost increases with usage, opaque architecture. Open-source solution (Qwen3) advantages: Transparent and reproducible, control over preprocessing and inference configurations, no API costs, supports offline deployment, facilitates research; Disadvantages: Requires local computing resources, Colab environment stability/memory constraints, slightly inferior performance in complex scenarios.

7

Section 07

Project Limitations and Future Directions

Current limitations: Dataset size was reduced due to API rate and runtime constraints; lack of formal category labels for in-depth statistics; commercial model architecture details are unavailable. Future directions: Incorporate human evaluation to complement automatic metrics; conduct segmented analysis based on category descriptions; experiment with prompt variations; perform cost-performance benchmarking.

8

Section 08

Conclusions and Implications

Core implications: Model selection is a multi-dimensional decision. Commercial models offer better semantic accuracy, but the transparency, reproducibility, and deployment flexibility of open-source models are more important in specific scenarios. Understanding these trade-offs aids in technical model selection, and the evaluation methodology of this project provides a reference framework for subsequent multimodal model comparisons.