Zing Forum

Reading

ClinicRealm: Systematic Re-evaluation of Large Language Models in Clinical Prediction Tasks

A study by Peking University team published in npj Digital Medicine shows that modern large language models (LLMs) have outperformed traditional machine learning methods in non-generative clinical prediction tasks, opening up new paths for zero-shot medical AI applications.

大语言模型临床预测电子健康记录医疗AI机器学习MIMIC-IV零样本学习开源模型
Published 2026-05-25 17:14Recent activity 2026-05-25 17:19Estimated read 7 min
ClinicRealm: Systematic Re-evaluation of Large Language Models in Clinical Prediction Tasks
1

Section 01

Introduction / Main Floor: ClinicRealm: Systematic Re-evaluation of Large Language Models in Clinical Prediction Tasks

A study by Peking University team published in npj Digital Medicine shows that modern large language models (LLMs) have outperformed traditional machine learning methods in non-generative clinical prediction tasks, opening up new paths for zero-shot medical AI applications.

2

Section 02

Original Authors and Sources

  • Original Author/Maintainer: Yinghao Zhu (PKU-AICare Team)
  • Source Platform: GitHub
  • Original Title: ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks
  • Original Link: https://github.com/yhzhu99/ehr-llm-benchmark
  • Paper Publication: npj Digital Medicine (2026), DOI: 10.1038/s41746-026-02539-z
  • Source Code Update Time: 2026-05-25
3

Section 03

Research Background and Motivation

With the widespread application of large language models (LLMs) such as ChatGPT and GPT-4 in the medical field, the industry has generally focused on their performance in generative tasks (e.g., medical record summarization, medical Q&A). However, there has long been a lack of systematic evaluation on the performance comparison between LLMs and traditional machine learning/deep learning methods for non-generative clinical prediction tasks—such as in-hospital mortality prediction, readmission risk assessment, and length of stay (LOS) estimation.

Clinical prediction is a core component of precision medicine. Traditional methods rely on structured electronic health record (EHR) data and use models like XGBoost, LSTM, and GRU for prediction. The emergence of LLMs brings new possibilities: Can they directly process unstructured clinical text notes? Can they exhibit stronger generalization ability in data-scarce scenarios? These questions are directly related to the selection strategy of clinical AI systems.

4

Section 04

ClinicRealm Research Framework

ClinicRealm, built by the AI Medicine Team of Peking University, is a comprehensive benchmark platform that systematically compares the performance of 31 different models on two types of data sources:

5

Section 05

Model Lineup

Large Language Models (15 types)

  • General-purpose LLMs: GPT-4o, GPT-5, DeepSeek-V3, Gemma-3, Qwen2.5
  • Medical-fine-tuned LLMs: BioGPT, Meditron, OpenBioLLM, BioMistral
  • Reasoning-enhanced LLMs: DeepSeek-R1 (7B/671B), HuatuoGPT-o1-7B, GPT o3-mini-high

BERT Series Models (5 types)

  • BERT, BioBERT, ClinicalBERT, GatorTron, Clinical-Longformer

Traditional Machine Learning Methods (11 types)

  • Classic ML: CatBoost, XGBoost, Random Forest, Decision Tree
  • Deep Learning: GRU, LSTM, RNN
  • Longitudinal EHR-specific models: AdaCare, ConCare, GRASP, AICare
6

Section 06

Datasets and Tasks

The study is based on two public medical datasets:

  • MIMIC-IV: Contains structured EHR data and unstructured clinical notes
  • TJH: Tongji Hospital COVID-19 Dataset (structured EHR)

Evaluation tasks include:

  1. In-hospital mortality prediction
  2. 30-day readmission prediction
  3. Length of Stay (LOS) prediction
  4. Medical sentence matching
  5. ICD code clustering
7

Section 07

Unstructured Clinical Text: LLMs Lead Across the Board

When processing clinical notes written by doctors, leading LLMs (such as DeepSeek-R1, DeepSeek-V3.1-Think, GPT-5) significantly outperformed fine-tuned BERT models in zero-shot settings. This finding is of great significance:

  • Zero-shot capability: Without fine-tuning for specific tasks, LLMs can directly extract predictive signals from clinical text
  • Text understanding advantage: LLMs demonstrate deep understanding of medical terminology and disease course descriptions
  • Deployment convenience: The zero-shot feature greatly reduces the deployment threshold of clinical AI systems
8

Section 08

Structured EHR Data: Data Volume Determines the Outcome

In structured data scenarios, the results present a more complex picture:

  • When data is sufficient: Specialized models (e.g., AICare, ConCare) perform best due to their dedicated modeling of longitudinal EHR sequences
  • When data is scarce: Advanced LLMs (e.g., GPT-4o, GPT-5, DeepSeek-V3.1-Think) can outperform traditional methods with their zero-shot capability
  • Practical implication: For hospitals with insufficient data accumulation or rare disease prediction, LLMs provide a feasible high-performance alternative