In the field of medical imaging diagnosis, radiologists need to read a large number of X-rays, CT, and MRI images every day and write detailed diagnostic reports. This work is both time-consuming and prone to omissions due to fatigue. In recent years, large language models (LLMs) have made breakthroughs in visual understanding, bringing new possibilities for automated medical imaging report generation.
However, a core problem has always plagued researchers and clinicians: How to accurately evaluate the quality of AI-generated medical reports? Traditional natural language generation (NLG) metrics such as BLEU and ROUGE mainly measure surface text similarity, but a report that "reads like" a human-written one may be completely wrong in clinical facts. Conversely, an AI report with different wording from the reference report may accurately describe all key lesions.
This project was born to solve this evaluation problem. It provides a complete toolchain that not only uses multimodal large models to generate chest X-ray diagnostic reports but also introduces five complementary evaluation dimensions, allowing researchers to comprehensively and objectively measure the quality of generated reports.