Zing Forum

Reading

CheXOne: A Visual-Language Foundation Model for Chest X-Rays with Reasoning Capabilities

CheXOne is a chest X-ray interpretation model developed by Stanford University. Through explicit reasoning chain generation and GRPO reinforcement learning optimization, its report quality meets or exceeds the level of resident physicians in over 50% of cases.

CheXOne医学影像胸部X光视觉语言模型推理能力AI诊断放射科GRPO
Published 2026-04-02 15:16Recent activity 2026-04-02 15:23Estimated read 6 min
CheXOne: A Visual-Language Foundation Model for Chest X-Rays with Reasoning Capabilities
1

Section 01

CheXOne: Introduction to the Visual-Language Foundation Model for Chest X-Rays with Reasoning Capabilities

CheXOne is a visual-language model for chest X-ray interpretation developed by the AIMI Lab at Stanford University. Its core features include explicit reasoning capabilities and GRPO reinforcement learning optimization. In over 50% of cases, its report quality meets or exceeds the level of resident physicians. It aims to address the shortage of radiologists, enhance the interpretability of AI diagnoses, and provide auxiliary support for medical practice.

2

Section 02

Background of Medical Imaging AI Development

Medical imaging diagnosis is a key part of healthcare. Chest X-rays (CXR) are widely used but their interpretation relies on professional radiologists, and there is a global shortage of such physicians. Existing medical imaging AI models are mostly black-box models; the lack of an explanation process leads to trust issues. General-purpose visual-language models struggle to integrate clinical knowledge for reasoning and generate structured reports in the medical field, which requires high multi-modal understanding and medical knowledge reserves.

3

Section 03

Core Innovations of CheXOne

  1. Explicit Reasoning Capability: Generates a chain-of-thought reasoning process, such as step-by-step derivation from image observations to diagnostic conclusions, enhancing interpretability; 2. Multi-task Support: Covers visual question answering, report generation, and visual localization, adapting to different clinical scenarios; 3. Report Quality: In over 50% of cases, report quality reaches the level of resident physicians, with clinical practical value.
4

Section 04

Technical Architecture and Training Methods of CheXOne

Post-trained on the Qwen2.5VL-3B-Instruct model, it has two stages: 1. Supervised Fine-tuning (SFT): Uses the CheXInstruct-v2 and CheXReason datasets to learn converting visual information into structured medical language and generating reasoning chains; 2. GRPO Reinforcement Learning: Preprocesses and filters samples with low variance to select information-rich ones, optimizing reasoning reliability and robustness.

5

Section 05

Dual-Mode Reasoning Design of CheXOne

Provides two modes: 1. Reasoning Mode: Generates a complete reasoning process before giving conclusions, with high performance, suitable for medical education and difficult case discussions; 2. Instruction Mode: Directly outputs answers, with fast speed, suitable for emergency screening and large-scale physical examinations. Flexible switching adapts to different clinical workflows.

6

Section 06

Clinical Application Prospects and Limitations of CheXOne

Application Prospects: Radiology assistance (prioritizing urgent cases), medical education (demonstrating interpretation approaches), basic screening in resource-poor areas. Limitations: Training data has population bias, only supports chest X-rays, and does not integrate multi-modal clinical data. Future Directions: Expand to more imaging modalities, integrate electronic medical records, develop disease-specific versions, and improve clinical validation.

7

Section 07

Open-Source Ecosystem and Technical Implementation Details of CheXOne

Open-Source Ecosystem: Provides a complete codebase (reproduction methods, data scripts, training/inference code, user research scripts, etc.) to facilitate academic verification and industrial applications. Technical Details: Supports vLLM/SGLang/LMDeploy inference frameworks and DeepSpeed distributed training; the visual encoder supports variable token counts, and Flash Attention 2 is recommended for accelerating inference.