# Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

> An open-source toolkit for automatic chest X-ray report generation and multi-dimensional evaluation based on large language models, supporting five complementary evaluation dimensions to ensure AI-generated medical reports are both fluent and clinically accurate.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-03T08:42:07.000Z
- 最近活动: 2026-06-03T09:24:07.739Z
- 热度: 163.3
- 关键词: 医学影像, 放射学报告, 大语言模型, 多模态AI, CheXpert, 胸部X光, 医疗AI评估, Qwen, 临床准确性, NLG评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/ai-cfdd96ba
- Canonical: https://www.zingnex.cn/forum/thread/ai-cfdd96ba
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

An open-source toolkit for automatic chest X-ray report generation and multi-dimensional evaluation based on large language models, supporting five complementary evaluation dimensions to ensure AI-generated medical reports are both fluent and clinically accurate.

## Original Author and Source

- **Original Author/Maintainer**: jinghanSunn
- **Source Platform**: GitHub
- **Original Title**: LLM-based Radiology Report Generation & Evaluation Toolkit (with Web Demo)
- **Original Link**: https://github.com/jinghanSunn/LLM-based-Radiology-Report-Generation-Evaluation-Toolkit
- **Publication Date**: June 3, 2026

---

## Project Background and Significance

In the field of medical imaging diagnosis, radiologists need to read a large number of X-rays, CT, and MRI images every day and write detailed diagnostic reports. This work is both time-consuming and prone to omissions due to fatigue. In recent years, large language models (LLMs) have made breakthroughs in visual understanding, bringing new possibilities for automated medical imaging report generation.

However, a core problem has always plagued researchers and clinicians: **How to accurately evaluate the quality of AI-generated medical reports?** Traditional natural language generation (NLG) metrics such as BLEU and ROUGE mainly measure surface text similarity, but a report that "reads like" a human-written one may be completely wrong in clinical facts. Conversely, an AI report with different wording from the reference report may accurately describe all key lesions.

This project was born to solve this evaluation problem. It provides a complete toolchain that not only uses multimodal large models to generate chest X-ray diagnostic reports but also introduces **five complementary evaluation dimensions**, allowing researchers to comprehensively and objectively measure the quality of generated reports.

---

## 1. Multi-Dimensional Evaluation System

The core innovation of the toolkit lies in its multi-dimensional evaluation framework, covering comprehensive assessments from surface text quality to deep clinical accuracy:

| Evaluation Dimension | Measured Content | Implementation Method |
|---------|---------|---------|
| **NLG Quality** | Surface similarity between text and reference report | BLEU-1/2/3/4, ROUGE-L, METEOR, BERTScore |
| **Clinical Accuracy (Model-Driven)** | Correctness of 14 CheXpert pathological labels | CheXbert extracts labels → calculate AUC/F1/Recall/Specificity |
| **Clinical Accuracy (LLM as Annotator)** | Same as above, but using LLM instead of CheXbert to extract labels | Pure API-driven, no local model files required |
| **Radiological Semantics** | Entity/relationship overlap and clinically weighted term similarity | RadGraph F1, RaTEScore |
| **LLM as Judge** | Clinical quality score (1-10) per case and omission/hallucination detection | Human-judgment-style scoring based on 4 clinical dimensions |

This combination of **NLG ⨯ Clinical ⨯ Semantics ⨯ Judgment** allows researchers to distinguish between reports that "read well" and those that are "clinically correct"—these two are often not the same.

## 2. Multi-Model Support

The toolkit natively supports multimodal models from the Qwen series:

- **Qwen2.5-VL-7B**: Lightweight visual-language model, suitable for rapid prototyping
- **Qwen3-VL-8B**: Next-generation visual-language model with better performance
- **Qwen3.5-27B**: Large-parameter model supporting thinking mode, suitable for high-quality report generation

Notably, the evaluation pipeline is **model-agnostic**—any LLM that generates JSON reports in the required format can be evaluated.

## 3. Gradio Interactive Web Demo

The project provides a Gradio-based web interface supporting:

- **Single-image report generation**: Upload a chest X-ray to generate a diagnostic report instantly
- **Dual-mode inference**: Supports API mode (OpenAI-compatible interface) and local mode (HuggingFace model)
- **Web search RAG**: Optional web retrieval enhancement; the model automatically obtains evidence from authoritative medical websites (Radiopaedia, PubMed, Mayo Clinic, etc.) to support diagnosis
- **Knowledge graph visualization**: Automatically generates Mermaid diagrams showing the association between detected lesions and retrieved evidence
- **LLM-as-Judge**: Let another LLM score the generated report on four clinical quality dimensions (1-10) and point out omitted or hallucinated lesions

---

## Experimental Results and Key Findings

The project conducted a systematic evaluation on the MIMIC-CXR test set, comparing the performance of different models. Below are some key findings:

## Evaluation Results with CheXbert as Annotator

Taking Qwen3.5-27B and Qwen3-VL-8B as examples, the average performance on 14 pathological labels:

| Model | Average AUC | Average F1 | Average Recall | Average Specificity |
|-----|--------|--------|-----------|-----------|
| Qwen3.5-27B | 0.5918 | 0.2931 | 0.3252 | 0.8585 |
| Qwen3-VL-8B | 0.5301 | 0.1854 | 0.2477 | 0.8125 |
