Zing Forum

Reading

Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

An open-source toolkit for automatic chest X-ray report generation and multi-dimensional evaluation based on large language models, supporting five complementary evaluation dimensions to ensure AI-generated medical reports are both fluent and clinically accurate.

医学影像放射学报告大语言模型多模态AICheXpert胸部X光医疗AI评估Qwen临床准确性NLG评估
Published 2026-06-03 16:42Recent activity 2026-06-03 17:24Estimated read 8 min
Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable
1

Section 01

Introduction / Main Floor: Medical Imaging Report Generation and Multi-Dimensional Evaluation Toolkit: Making AI Diagnosis More Reliable

An open-source toolkit for automatic chest X-ray report generation and multi-dimensional evaluation based on large language models, supporting five complementary evaluation dimensions to ensure AI-generated medical reports are both fluent and clinically accurate.

3

Section 03

Project Background and Significance

In the field of medical imaging diagnosis, radiologists need to read a large number of X-rays, CT, and MRI images every day and write detailed diagnostic reports. This work is both time-consuming and prone to omissions due to fatigue. In recent years, large language models (LLMs) have made breakthroughs in visual understanding, bringing new possibilities for automated medical imaging report generation.

However, a core problem has always plagued researchers and clinicians: How to accurately evaluate the quality of AI-generated medical reports? Traditional natural language generation (NLG) metrics such as BLEU and ROUGE mainly measure surface text similarity, but a report that "reads like" a human-written one may be completely wrong in clinical facts. Conversely, an AI report with different wording from the reference report may accurately describe all key lesions.

This project was born to solve this evaluation problem. It provides a complete toolchain that not only uses multimodal large models to generate chest X-ray diagnostic reports but also introduces five complementary evaluation dimensions, allowing researchers to comprehensively and objectively measure the quality of generated reports.


4

Section 04

1. Multi-Dimensional Evaluation System

The core innovation of the toolkit lies in its multi-dimensional evaluation framework, covering comprehensive assessments from surface text quality to deep clinical accuracy:

Evaluation Dimension Measured Content Implementation Method
NLG Quality Surface similarity between text and reference report BLEU-1/2/3/4, ROUGE-L, METEOR, BERTScore
Clinical Accuracy (Model-Driven) Correctness of 14 CheXpert pathological labels CheXbert extracts labels → calculate AUC/F1/Recall/Specificity
Clinical Accuracy (LLM as Annotator) Same as above, but using LLM instead of CheXbert to extract labels Pure API-driven, no local model files required
Radiological Semantics Entity/relationship overlap and clinically weighted term similarity RadGraph F1, RaTEScore
LLM as Judge Clinical quality score (1-10) per case and omission/hallucination detection Human-judgment-style scoring based on 4 clinical dimensions

This combination of NLG ⨯ Clinical ⨯ Semantics ⨯ Judgment allows researchers to distinguish between reports that "read well" and those that are "clinically correct"—these two are often not the same.

5

Section 05

2. Multi-Model Support

The toolkit natively supports multimodal models from the Qwen series:

  • Qwen2.5-VL-7B: Lightweight visual-language model, suitable for rapid prototyping
  • Qwen3-VL-8B: Next-generation visual-language model with better performance
  • Qwen3.5-27B: Large-parameter model supporting thinking mode, suitable for high-quality report generation

Notably, the evaluation pipeline is model-agnostic—any LLM that generates JSON reports in the required format can be evaluated.

6

Section 06

3. Gradio Interactive Web Demo

The project provides a Gradio-based web interface supporting:

  • Single-image report generation: Upload a chest X-ray to generate a diagnostic report instantly
  • Dual-mode inference: Supports API mode (OpenAI-compatible interface) and local mode (HuggingFace model)
  • Web search RAG: Optional web retrieval enhancement; the model automatically obtains evidence from authoritative medical websites (Radiopaedia, PubMed, Mayo Clinic, etc.) to support diagnosis
  • Knowledge graph visualization: Automatically generates Mermaid diagrams showing the association between detected lesions and retrieved evidence
  • LLM-as-Judge: Let another LLM score the generated report on four clinical quality dimensions (1-10) and point out omitted or hallucinated lesions

7

Section 07

Experimental Results and Key Findings

The project conducted a systematic evaluation on the MIMIC-CXR test set, comparing the performance of different models. Below are some key findings:

8

Section 08

Evaluation Results with CheXbert as Annotator

Taking Qwen3.5-27B and Qwen3-VL-8B as examples, the average performance on 14 pathological labels:

Model Average AUC Average F1 Average Recall Average Specificity
Qwen3.5-27B 0.5918 0.2931 0.3252 0.8585
Qwen3-VL-8B 0.5301 0.1854 0.2477 0.8125