# CheXOne: A Visual-Language Foundation Model for Chest X-Rays with Reasoning Capabilities

> CheXOne is a chest X-ray interpretation model developed by Stanford University. Through explicit reasoning chain generation and GRPO reinforcement learning optimization, its report quality meets or exceeds the level of resident physicians in over 50% of cases.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-02T07:16:21.000Z
- 最近活动: 2026-04-02T07:23:01.053Z
- 热度: 150.9
- 关键词: CheXOne, 医学影像, 胸部X光, 视觉语言模型, 推理能力, AI诊断, 放射科, GRPO
- 页面链接: https://www.zingnex.cn/en/forum/thread/chexone-x
- Canonical: https://www.zingnex.cn/forum/thread/chexone-x
- Markdown 来源: floors_fallback

---

## CheXOne: Introduction to the Visual-Language Foundation Model for Chest X-Rays with Reasoning Capabilities

CheXOne is a visual-language model for chest X-ray interpretation developed by the AIMI Lab at Stanford University. Its core features include explicit reasoning capabilities and GRPO reinforcement learning optimization. In over 50% of cases, its report quality meets or exceeds the level of resident physicians. It aims to address the shortage of radiologists, enhance the interpretability of AI diagnoses, and provide auxiliary support for medical practice.

## Background of Medical Imaging AI Development

Medical imaging diagnosis is a key part of healthcare. Chest X-rays (CXR) are widely used but their interpretation relies on professional radiologists, and there is a global shortage of such physicians. Existing medical imaging AI models are mostly black-box models; the lack of an explanation process leads to trust issues. General-purpose visual-language models struggle to integrate clinical knowledge for reasoning and generate structured reports in the medical field, which requires high multi-modal understanding and medical knowledge reserves.

## Core Innovations of CheXOne

1. **Explicit Reasoning Capability**: Generates a chain-of-thought reasoning process, such as step-by-step derivation from image observations to diagnostic conclusions, enhancing interpretability; 2. **Multi-task Support**: Covers visual question answering, report generation, and visual localization, adapting to different clinical scenarios; 3. **Report Quality**: In over 50% of cases, report quality reaches the level of resident physicians, with clinical practical value.

## Technical Architecture and Training Methods of CheXOne

Post-trained on the Qwen2.5VL-3B-Instruct model, it has two stages: 1. **Supervised Fine-tuning (SFT)**: Uses the CheXInstruct-v2 and CheXReason datasets to learn converting visual information into structured medical language and generating reasoning chains; 2. **GRPO Reinforcement Learning**: Preprocesses and filters samples with low variance to select information-rich ones, optimizing reasoning reliability and robustness.

## Dual-Mode Reasoning Design of CheXOne

Provides two modes: 1. **Reasoning Mode**: Generates a complete reasoning process before giving conclusions, with high performance, suitable for medical education and difficult case discussions; 2. **Instruction Mode**: Directly outputs answers, with fast speed, suitable for emergency screening and large-scale physical examinations. Flexible switching adapts to different clinical workflows.

## Clinical Application Prospects and Limitations of CheXOne

**Application Prospects**: Radiology assistance (prioritizing urgent cases), medical education (demonstrating interpretation approaches), basic screening in resource-poor areas. **Limitations**: Training data has population bias, only supports chest X-rays, and does not integrate multi-modal clinical data. **Future Directions**: Expand to more imaging modalities, integrate electronic medical records, develop disease-specific versions, and improve clinical validation.

## Open-Source Ecosystem and Technical Implementation Details of CheXOne

**Open-Source Ecosystem**: Provides a complete codebase (reproduction methods, data scripts, training/inference code, user research scripts, etc.) to facilitate academic verification and industrial applications. **Technical Details**: Supports vLLM/SGLang/LMDeploy inference frameworks and DeepSpeed distributed training; the visual encoder supports variable token counts, and Flash Attention 2 is recommended for accelerating inference.
