# Colon-Bench: A Large-Scale Colonoscopy Video Lesion Annotation Benchmark Dataset Based on Agent Workflow

> The research team has released Colon-Bench, the largest colonoscopy video dataset to date. It achieves scalable dense annotation of full surgical videos through a multi-stage agent workflow, providing an important benchmark for evaluating the capabilities of multimodal large language models (MLLMs) in the field of medical video understanding.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-26T16:58:43.000Z
- 最近活动: 2026-03-27T04:50:18.846Z
- 热度: 128.1
- 关键词: 医学AI, 结肠镜, 多模态大语言模型, 视频理解, 数据集, 智能体工作流, 病灶检测
- 页面链接: https://www.zingnex.cn/en/forum/thread/colon-bench
- Canonical: https://www.zingnex.cn/forum/thread/colon-bench
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Colon-Bench: A Large-Scale Colonoscopy Video Lesion Annotation Benchmark Dataset Based on Agent Workflow

The research team has released Colon-Bench, the largest colonoscopy video dataset to date. It achieves scalable dense annotation of full surgical videos through a multi-stage agent workflow, providing an important benchmark for evaluating the capabilities of multimodal large language models (MLLMs) in the field of medical video understanding.

## Research Background

Early screening for colorectal cancer is crucial for prevention, and colonoscopy is the primary method. However, developing robust AI systems faces significant challenges: the lack of densely annotated long-sequence video datasets. Existing datasets mainly focus on single-class polyp detection and lack the spatial, temporal, and linguistic annotations required to evaluate modern multimodal large language models (MLLMs).

## Core Contributions

The research team proposes the **Colon-Bench** benchmark dataset, generated using an innovative multi-stage agent workflow:

- **Temporal Proposal Generation**: Identify potential lesion segments
- **Bounding Box Tracking**: Track lesion positions across frames
- **AI Visual Confirmation**: Automatically verify annotation quality
- **Human-AI Collaborative Review**: Final check by experts

## Dataset Scale

The scale of Colon-Bench is unprecedented:
- 528 full surgical videos
- 14 lesion categories (polyps, ulcers, bleeding, etc.)
- Over 300,000 bounding box annotations
- 213,000 segmentation masks
- 133,000 words of clinical descriptions

## Experimental Findings

The research team evaluated state-of-the-art MLLMs on three tasks: lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and Video Visual Question Answering (VQA). Surprisingly, MLLMs showed higher localization performance than SAM-3 in the medical field.

In addition, by analyzing VQA error patterns, the team proposed a novel "Colon Skill" prompting strategy, which improved the zero-shot MLLM performance by up to 9.7%.

## Resource Links

- Dataset and Code: https://abdullahamdi.com/colon-bench
- Paper: http://arxiv.org/abs/2603.25645v1