Zing Forum

Reading

ScanFormer: A Multimodal Medical Image Report Generation Model Fine-Tuned with LoRA

ScanFormer, an undergraduate project from IIT Gandhinagar, combines the LLaVA-Med vision-language architecture with LoRA efficient fine-tuning technology. Trained on 220,000 chest X-ray images, it enables automated radiology report generation while preventing catastrophic forgetting via EWC technology.

ScanFormer医学影像放射学报告LoRALLaVA-MedCheXpert多模态模型灾难性遗忘EWC视觉语言模型
Published 2026-04-01 12:00Recent activity 2026-04-01 12:21Estimated read 7 min
ScanFormer: A Multimodal Medical Image Report Generation Model Fine-Tuned with LoRA
1

Section 01

Introduction to ScanFormer: A Multimodal Medical Image Report Generation Model Fine-Tuned with LoRA

ScanFormer is an independent research project developed by Divya Rahul Shah, an undergraduate at the Indian Institute of Technology Gandhinagar (IIT Gandhinagar). It aims to integrate modern multimodal large language model technology with parameter-efficient fine-tuning methods to build a practical medical image report generation system. Based on the LLaVA-Med vision-language architecture, the model uses LoRA efficient fine-tuning technology (training only about 2% of parameters) and EWC technology to prevent catastrophic forgetting. Trained on the CheXpert dataset (224,316 chest X-ray images), it achieves automated radiology report generation. Key achievements include: report quality BLEU-4 score of 38.4, clinical factuality of 89.7%, general language ability retention of 96.2%, and hallucination rate as low as 4.1%.

2

Section 02

Project Background and Core Challenges

There is an urgent need for automated medical image analysis, but there is a shortage of professional radiologists. The core problem addressed by ScanFormer is how to specialize general vision-language models (VLMs) for the medical imaging field while avoiding 'catastrophic forgetting'—that is, when the model forgets original knowledge during training on new tasks (e.g., when fine-tuning a general VLM on medical data, it may lose general visual understanding ability, or retain general ability but fail to fully learn medical professional knowledge).

3

Section 03

Detailed Technical Architecture

ScanFormer is built based on LLaVA-Med (a medical-adapted version of LLaVA), integrating the following key technologies:

  1. LoRA Fine-Tuning: Freeze pre-trained model weights, introduce low-rank matrices (rank 16, alpha 32), train only about 2% of parameters to achieve parameter-efficient adaptation to medical tasks;
  2. EWC Technology: Identify parameters important to the original task and apply penalties to prevent the model from forgetting general language abilities;
  3. Visual Grounding Checker: Monitor the visual attention distribution when the model generates reports, mark potential hallucination cases where descriptions do not match attention, and reduce the risk of misdiagnosis.
4

Section 04

Dataset and Training Objectives

The model is trained using the CheXpert chest X-ray dataset released by Stanford University, which contains 224,316 images and multi-label pathological annotations (such as opacity, pleural effusion, etc.). The training objective is to generate structured radiology reports, covering pathological sign recognition, natural language description, and structured format, facilitating clinical processing and archiving.

5

Section 05

Performance Evaluation Results

ScanFormer performed excellently in multi-dimensional evaluations:

  • Report Quality: BLEU-4 score of 38.4, high n-gram overlap with manual reports;
  • Clinical Factuality: 89.7%, strong consistency between generated content and actual images;
  • General Language Retention: 96.2%, verifying the effectiveness of EWC in preventing forgetting;
  • Hallucination Rate: 4.1%, which is at a good level in the medical field, and the grounding checker can further optimize this.
6

Section 06

Application Scenarios and Value

The application value of ScanFormer is reflected in:

  1. Auxiliary Diagnosis: Automatically screen abnormal images, generate report drafts, and mark omissions by comparing with manual reports;
  2. Medical Resource Balance: Provide basic image interpretation capabilities for areas with a shortage of radiologists;
  3. Teaching and Research: Generate structured reports as teaching materials for medical students, and help build large-scale medical image-text datasets.
7

Section 07

Limitations and Future Improvement Directions

As an undergraduate project, ScanFormer has limitations: it only supports chest X-rays (single modality), the CheXpert dataset is biased towards the U.S. population (generalization needs to be verified), and it requires regulatory approval for clinical deployment. Future improvement directions include: expanding to multi-modalities such as CT/MRI, improving generalization ability through larger-scale training, optimizing human-computer collaboration interfaces, and adding prediction confidence estimation.