Reading

Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

An open-source toolkit for automatic chest X-ray report generation and multi-dimensional evaluation based on large language models, supporting five complementary evaluation dimensions to ensure AI-generated medical reports are both fluent and clinically accurate.

医学影像放射学报告大语言模型多模态AICheXpert胸部X光医疗AI评估Qwen临床准确性NLG评估

Published 2026-06-03 16:42Recent activity 2026-06-03 17:24Estimated read 8 min

Section 01

Introduction / Main Floor: Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

Section 02

Original Author and Source

Original Author/Maintainer: jinghanSunn
Source Platform: GitHub
Original Title: LLM-based Radiology Report Generation & Evaluation Toolkit (with Web Demo)
Original Link: https://github.com/jinghanSunn/LLM-based-Radiology-Report-Generation-Evaluation-Toolkit
Publication Date: June 3, 2026

Section 03

Project Background and Significance

In the field of medical imaging diagnosis, radiologists need to read a large number of X-rays, CT, and MRI images every day and write detailed diagnostic reports. This work is both time-consuming and prone to omissions due to fatigue. In recent years, large language models (LLMs) have made breakthroughs in visual understanding, bringing new possibilities for automated medical imaging report generation.

However, a core problem has always plagued researchers and clinicians: How to accurately evaluate the quality of AI-generated medical reports? Traditional natural language generation (NLG) metrics such as BLEU and ROUGE mainly measure surface text similarity, but a report that "reads like" a human-written one may be completely wrong in clinical facts. Conversely, an AI report with different wording from the reference report may accurately describe all key lesions.

This project was born to solve this evaluation problem. It provides a complete toolchain that not only uses multimodal large models to generate chest X-ray diagnostic reports but also introduces five complementary evaluation dimensions, allowing researchers to comprehensively and objectively measure the quality of generated reports.

Section 04

1. Multi-Dimensional Evaluation System

The core innovation of the toolkit lies in its multi-dimensional evaluation framework, covering comprehensive assessments from surface text quality to deep clinical accuracy:

Evaluation Dimension	Measured Content	Implementation Method
NLG Quality	Surface similarity between text and reference report	BLEU-1/2/3/4, ROUGE-L, METEOR, BERTScore
Clinical Accuracy (Model-Driven)	Correctness of 14 CheXpert pathological labels	CheXbert extracts labels → calculate AUC/F1/Recall/Specificity
Clinical Accuracy (LLM as Annotator)	Same as above, but using LLM instead of CheXbert to extract labels	Pure API-driven, no local model files required
Radiological Semantics	Entity/relationship overlap and clinically weighted term similarity	RadGraph F1, RaTEScore
LLM as Judge	Clinical quality score (1-10) per case and omission/hallucination detection	Human-judgment-style scoring based on 4 clinical dimensions

This combination of NLG ⨯ Clinical ⨯ Semantics ⨯ Judgment allows researchers to distinguish between reports that "read well" and those that are "clinically correct"—these two are often not the same.

Section 05

2. Multi-Model Support

The toolkit natively supports multimodal models from the Qwen series:

Qwen2.5-VL-7B: Lightweight visual-language model, suitable for rapid prototyping
Qwen3-VL-8B: Next-generation visual-language model with better performance
Qwen3.5-27B: Large-parameter model supporting thinking mode, suitable for high-quality report generation

Notably, the evaluation pipeline is model-agnostic—any LLM that generates JSON reports in the required format can be evaluated.

Section 06

3. Gradio Interactive Web Demo

The project provides a Gradio-based web interface supporting:

Single-image report generation: Upload a chest X-ray to generate a diagnostic report instantly
Dual-mode inference: Supports API mode (OpenAI-compatible interface) and local mode (HuggingFace model)
Web search RAG: Optional web retrieval enhancement; the model automatically obtains evidence from authoritative medical websites (Radiopaedia, PubMed, Mayo Clinic, etc.) to support diagnosis
Knowledge graph visualization: Automatically generates Mermaid diagrams showing the association between detected lesions and retrieved evidence
LLM-as-Judge: Let another LLM score the generated report on four clinical quality dimensions (1-10) and point out omitted or hallucinated lesions

Section 07

Experimental Results and Key Findings

The project conducted a systematic evaluation on the MIMIC-CXR test set, comparing the performance of different models. Below are some key findings:

Section 08

Evaluation Results with CheXbert as Annotator

Taking Qwen3.5-27B and Qwen3-VL-8B as examples, the average performance on 14 pathological labels:

Model	Average AUC	Average F1	Average Recall	Average Specificity
Qwen3.5-27B	0.5918	0.2931	0.3252	0.8585
Qwen3-VL-8B	0.5301	0.1854	0.2477	0.8125

Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

Introduction / Main Floor: Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

Original Author and Source

Project Background and Significance

1. Multi-Dimensional Evaluation System

2. Multi-Model Support

3. Gradio Interactive Web Demo

Experimental Results and Key Findings

Evaluation Results with CheXbert as Annotator

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization

Building an Enterprise-Grade Real-Time MLOps Platform: A Complete Practice from Automated Training to Continuous Deployment