# ClinicRealm: Systematic Re-evaluation of Large Language Models in Clinical Prediction Tasks

> A study by Peking University team published in npj Digital Medicine shows that modern large language models (LLMs) have outperformed traditional machine learning methods in non-generative clinical prediction tasks, opening up new paths for zero-shot medical AI applications.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-25T09:14:21.000Z
- 最近活动: 2026-05-25T09:19:02.086Z
- 热度: 159.9
- 关键词: 大语言模型, 临床预测, 电子健康记录, 医疗AI, 机器学习, MIMIC-IV, 零样本学习, 开源模型
- 页面链接: https://www.zingnex.cn/en/forum/thread/clinicrealm-74fb23ba
- Canonical: https://www.zingnex.cn/forum/thread/clinicrealm-74fb23ba
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: ClinicRealm: Systematic Re-evaluation of Large Language Models in Clinical Prediction Tasks

A study by Peking University team published in npj Digital Medicine shows that modern large language models (LLMs) have outperformed traditional machine learning methods in non-generative clinical prediction tasks, opening up new paths for zero-shot medical AI applications.

## Original Authors and Sources

- **Original Author/Maintainer**: Yinghao Zhu (PKU-AICare Team)
- **Source Platform**: GitHub
- **Original Title**: ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks
- **Original Link**: https://github.com/yhzhu99/ehr-llm-benchmark
- **Paper Publication**: npj Digital Medicine (2026), DOI: 10.1038/s41746-026-02539-z
- **Source Code Update Time**: 2026-05-25

## Research Background and Motivation

With the widespread application of large language models (LLMs) such as ChatGPT and GPT-4 in the medical field, the industry has generally focused on their performance in generative tasks (e.g., medical record summarization, medical Q&A). However, there has long been a lack of systematic evaluation on the performance comparison between LLMs and traditional machine learning/deep learning methods for non-generative clinical prediction tasks—such as in-hospital mortality prediction, readmission risk assessment, and length of stay (LOS) estimation.

Clinical prediction is a core component of precision medicine. Traditional methods rely on structured electronic health record (EHR) data and use models like XGBoost, LSTM, and GRU for prediction. The emergence of LLMs brings new possibilities: Can they directly process unstructured clinical text notes? Can they exhibit stronger generalization ability in data-scarce scenarios? These questions are directly related to the selection strategy of clinical AI systems.

## ClinicRealm Research Framework

ClinicRealm, built by the AI Medicine Team of Peking University, is a comprehensive benchmark platform that systematically compares the performance of 31 different models on two types of data sources:

## Model Lineup

**Large Language Models (15 types)**
- General-purpose LLMs: GPT-4o, GPT-5, DeepSeek-V3, Gemma-3, Qwen2.5
- Medical-fine-tuned LLMs: BioGPT, Meditron, OpenBioLLM, BioMistral
- Reasoning-enhanced LLMs: DeepSeek-R1 (7B/671B), HuatuoGPT-o1-7B, GPT o3-mini-high

**BERT Series Models (5 types)**
- BERT, BioBERT, ClinicalBERT, GatorTron, Clinical-Longformer

**Traditional Machine Learning Methods (11 types)**
- Classic ML: CatBoost, XGBoost, Random Forest, Decision Tree
- Deep Learning: GRU, LSTM, RNN
- Longitudinal EHR-specific models: AdaCare, ConCare, GRASP, AICare

## Datasets and Tasks

The study is based on two public medical datasets:
- **MIMIC-IV**: Contains structured EHR data and unstructured clinical notes
- **TJH**: Tongji Hospital COVID-19 Dataset (structured EHR)

Evaluation tasks include:
1. In-hospital mortality prediction
2. 30-day readmission prediction
3. Length of Stay (LOS) prediction
4. Medical sentence matching
5. ICD code clustering

## Unstructured Clinical Text: LLMs Lead Across the Board

When processing clinical notes written by doctors, leading LLMs (such as DeepSeek-R1, DeepSeek-V3.1-Think, GPT-5) significantly outperformed fine-tuned BERT models in zero-shot settings. This finding is of great significance:

- **Zero-shot capability**: Without fine-tuning for specific tasks, LLMs can directly extract predictive signals from clinical text
- **Text understanding advantage**: LLMs demonstrate deep understanding of medical terminology and disease course descriptions
- **Deployment convenience**: The zero-shot feature greatly reduces the deployment threshold of clinical AI systems

## Structured EHR Data: Data Volume Determines the Outcome

In structured data scenarios, the results present a more complex picture:

- **When data is sufficient**: Specialized models (e.g., AICare, ConCare) perform best due to their dedicated modeling of longitudinal EHR sequences
- **When data is scarce**: Advanced LLMs (e.g., GPT-4o, GPT-5, DeepSeek-V3.1-Think) can outperform traditional methods with their zero-shot capability
- **Practical implication**: For hospitals with insufficient data accumulation or rare disease prediction, LLMs provide a feasible high-performance alternative
