Reading

ScanFormer: A Multimodal Medical Image Report Generation Model Fine-Tuned with LoRA

ScanFormer, an undergraduate project from IIT Gandhinagar, combines the LLaVA-Med vision-language architecture with LoRA efficient fine-tuning technology. Trained on 220,000 chest X-ray images, it enables automated radiology report generation while preventing catastrophic forgetting via EWC technology.

ScanFormer医学影像放射学报告LoRALLaVA-MedCheXpert多模态模型灾难性遗忘EWC视觉语言模型

Published 2026-04-01 12:00Recent activity 2026-04-01 12:21Estimated read 7 min

Section 01

Introduction to ScanFormer: A Multimodal Medical Image Report Generation Model Fine-Tuned with LoRA

ScanFormer is an independent research project developed by Divya Rahul Shah, an undergraduate at the Indian Institute of Technology Gandhinagar (IIT Gandhinagar). It aims to integrate modern multimodal large language model technology with parameter-efficient fine-tuning methods to build a practical medical image report generation system. Based on the LLaVA-Med vision-language architecture, the model uses LoRA efficient fine-tuning technology (training only about 2% of parameters) and EWC technology to prevent catastrophic forgetting. Trained on the CheXpert dataset (224,316 chest X-ray images), it achieves automated radiology report generation. Key achievements include: report quality BLEU-4 score of 38.4, clinical factuality of 89.7%, general language ability retention of 96.2%, and hallucination rate as low as 4.1%.

Section 02

Project Background and Core Challenges

There is an urgent need for automated medical image analysis, but there is a shortage of professional radiologists. The core problem addressed by ScanFormer is how to specialize general vision-language models (VLMs) for the medical imaging field while avoiding 'catastrophic forgetting'—that is, when the model forgets original knowledge during training on new tasks (e.g., when fine-tuning a general VLM on medical data, it may lose general visual understanding ability, or retain general ability but fail to fully learn medical professional knowledge).

Section 03

Detailed Technical Architecture

ScanFormer is built based on LLaVA-Med (a medical-adapted version of LLaVA), integrating the following key technologies:

LoRA Fine-Tuning: Freeze pre-trained model weights, introduce low-rank matrices (rank 16, alpha 32), train only about 2% of parameters to achieve parameter-efficient adaptation to medical tasks;
EWC Technology: Identify parameters important to the original task and apply penalties to prevent the model from forgetting general language abilities;
Visual Grounding Checker: Monitor the visual attention distribution when the model generates reports, mark potential hallucination cases where descriptions do not match attention, and reduce the risk of misdiagnosis.

Section 04

Dataset and Training Objectives

The model is trained using the CheXpert chest X-ray dataset released by Stanford University, which contains 224,316 images and multi-label pathological annotations (such as opacity, pleural effusion, etc.). The training objective is to generate structured radiology reports, covering pathological sign recognition, natural language description, and structured format, facilitating clinical processing and archiving.

Section 05

Performance Evaluation Results

ScanFormer performed excellently in multi-dimensional evaluations:

Report Quality: BLEU-4 score of 38.4, high n-gram overlap with manual reports;
Clinical Factuality: 89.7%, strong consistency between generated content and actual images;
General Language Retention: 96.2%, verifying the effectiveness of EWC in preventing forgetting;
Hallucination Rate: 4.1%, which is at a good level in the medical field, and the grounding checker can further optimize this.

Section 06

Application Scenarios and Value

The application value of ScanFormer is reflected in:

Auxiliary Diagnosis: Automatically screen abnormal images, generate report drafts, and mark omissions by comparing with manual reports;
Medical Resource Balance: Provide basic image interpretation capabilities for areas with a shortage of radiologists;
Teaching and Research: Generate structured reports as teaching materials for medical students, and help build large-scale medical image-text datasets.

Section 07

Limitations and Future Improvement Directions

As an undergraduate project, ScanFormer has limitations: it only supports chest X-rays (single modality), the CheXpert dataset is biased towards the U.S. population (generalization needs to be verified), and it requires regulatory approval for clinical deployment. Future improvement directions include: expanding to multi-modalities such as CT/MRI, improving generalization ability through larger-scale training, optimizing human-computer collaboration interfaces, and adding prediction confidence estimation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15