Zing Forum

Reading

Colon-Bench: A Large-Scale Colonoscopy Video Lesion Annotation Benchmark Dataset Based on Agent Workflow

The research team has released Colon-Bench, the largest colonoscopy video dataset to date. It achieves scalable dense annotation of full surgical videos through a multi-stage agent workflow, providing an important benchmark for evaluating the capabilities of multimodal large language models (MLLMs) in the field of medical video understanding.

医学AI结肠镜多模态大语言模型视频理解数据集智能体工作流病灶检测
Published 2026-03-27 00:58Recent activity 2026-03-27 12:50Estimated read 4 min
Colon-Bench: A Large-Scale Colonoscopy Video Lesion Annotation Benchmark Dataset Based on Agent Workflow
1

Section 01

Introduction / Main Floor: Colon-Bench: A Large-Scale Colonoscopy Video Lesion Annotation Benchmark Dataset Based on Agent Workflow

The research team has released Colon-Bench, the largest colonoscopy video dataset to date. It achieves scalable dense annotation of full surgical videos through a multi-stage agent workflow, providing an important benchmark for evaluating the capabilities of multimodal large language models (MLLMs) in the field of medical video understanding.

2

Section 02

Research Background

Early screening for colorectal cancer is crucial for prevention, and colonoscopy is the primary method. However, developing robust AI systems faces significant challenges: the lack of densely annotated long-sequence video datasets. Existing datasets mainly focus on single-class polyp detection and lack the spatial, temporal, and linguistic annotations required to evaluate modern multimodal large language models (MLLMs).

3

Section 03

Core Contributions

The research team proposes the Colon-Bench benchmark dataset, generated using an innovative multi-stage agent workflow:

  • Temporal Proposal Generation: Identify potential lesion segments
  • Bounding Box Tracking: Track lesion positions across frames
  • AI Visual Confirmation: Automatically verify annotation quality
  • Human-AI Collaborative Review: Final check by experts
4

Section 04

Dataset Scale

The scale of Colon-Bench is unprecedented:

  • 528 full surgical videos
  • 14 lesion categories (polyps, ulcers, bleeding, etc.)
  • Over 300,000 bounding box annotations
  • 213,000 segmentation masks
  • 133,000 words of clinical descriptions
5

Section 05

Experimental Findings

The research team evaluated state-of-the-art MLLMs on three tasks: lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and Video Visual Question Answering (VQA). Surprisingly, MLLMs showed higher localization performance than SAM-3 in the medical field.

In addition, by analyzing VQA error patterns, the team proposed a novel "Colon Skill" prompting strategy, which improved the zero-shot MLLM performance by up to 9.7%.