Zing Forum

Reading

SKKU Multimodal AI Challenge 2026: Building a Fair and Reliable Image-Text Visual Question Answering Model

Solution for the 2026 Sungkyunkwan University Multimodal AI Challenge, targeting the image-text visual question answering task, using the Qwen3-VL MoE model and multi-agent debate architecture to address data bias and answer abstention calibration issues.

多模态AI视觉问答VQAQwen3-VLMoE多智能体偏见检测弃权校准BBQ数据集竞赛解决方案
Published 2026-06-04 00:42Recent activity 2026-06-04 00:52Estimated read 6 min
SKKU Multimodal AI Challenge 2026: Building a Fair and Reliable Image-Text Visual Question Answering Model
1

Section 01

Guide to the 2026 SKKU Multimodal AI Challenge Solution

The 2026 Sungkyunkwan University Multimodal AI Challenge focuses on the image-text Visual Question Answering (VQA) task, aiming to build a fair and reliable model. This solution uses the Qwen3-VL MoE model and multi-agent debate architecture, focusing on solving data bias and answer abstention calibration issues. It avoids image-induced bias through the text-first principle, achieves calibrated abstention decisions, and provides a reference for the design of fair multimodal AI systems.

2

Section 02

Competition Background and Challenge Objectives

The 2026 Sungkyunkwan University Multimodal AI Challenge aims to develop a fair and reliable image-text question answering model that exceeds the balanced accuracy benchmark of 0.98-1.0. The core challenge is handling bias in multimodal data. The dataset includes images, text context, questions, and three answers (including an unknown option). The evaluation metric is the average of the accuracy of ambiguous samples and clear samples (balanced accuracy).

3

Section 03

Analysis of Core Task Difficulties

  • Sample Differentiation: Ambiguous samples require selecting the unknown option (as there is no basis in the context), while clear samples require selecting a specific answer. The hidden nature of sample types makes calibrated abstention challenging;
  • Image Bias: Images are a bait that induces bias; the real signal lies in the text;
  • Value of BBQ Dataset: Provides labels and pattern structures, supporting offline balanced accuracy measurement and model tuning.
4

Section 04

Technical Architecture and Solution

  • Model Selection: Adopt the Qwen3-VL MoE model (31 billion total parameters, 3 billion activated), which has advantages such as fast speed (0.5 seconds per sample), multi-agent support, and memory efficiency (runs on 48GB VRAM);
  • Multi-agent Debate: A single model switches roles (analyst, supporter, skeptic, referee) to save memory;
  • Auxiliary Tools: The unknown option detector identifies the position of unknown options with 100% accuracy, supporting information provision and offline metric calculation.
5

Section 05

Core Strategy: Calibrated Abstention Mechanism

  • Metric Monitoring: Optimize strategies using over-commitment rate (ambiguous samples selecting specific answers) and over-abstention rate (clear samples selecting unknown);
  • Text-First Principle: First analyze whether the text context is clear; if clear, select a specific answer, otherwise select unknown, ignoring image bias.
6

Section 06

Execution Flow and Development Roadmap

  • Environment Usage: Local Mac supports data inspection and code editing; Colab/A6000 can perform inference (installation and running commands are provided);
  • Development Plan: Inference pipeline has been completed; prompt optimization, LangGraph debate version implementation, and LoRA fine-tuning are pending.
7

Section 07

Technical Innovations and Value

  • Bias Avoidance: Identify image bias bait and establish a text-first framework;
  • Abstention Mechanism: Can be applied to AI scenarios requiring reliability and uncertainty quantification;
  • Multi-agent Architecture: Single model role switching reduces memory requirements, suitable for resource-constrained environments.
8

Section 08

Summary and Insights

This solution demonstrates a systematic approach to addressing multimodal AI bias: identifying bias sources through data analysis, establishing a calibrated decision mechanism, and adopting a resource-efficient architecture. Its text-first principle and calibrated abstention mechanism provide a reusable methodological framework for developing fair multimodal AI systems.