Reading

ClinicRealm: Systematic Re-evaluation of Large Language Models in Clinical Prediction Tasks

A study by Peking University team published in npj Digital Medicine shows that modern large language models (LLMs) have outperformed traditional machine learning methods in non-generative clinical prediction tasks, opening up new paths for zero-shot medical AI applications.

大语言模型临床预测电子健康记录医疗AI机器学习MIMIC-IV零样本学习开源模型

Published 2026-05-25 17:14Recent activity 2026-05-25 17:19Estimated read 7 min

Section 01

Introduction / Main Floor: ClinicRealm: Systematic Re-evaluation of Large Language Models in Clinical Prediction Tasks

Section 02

Original Authors and Sources

Original Author/Maintainer: Yinghao Zhu (PKU-AICare Team)
Source Platform: GitHub
Original Title: ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks
Original Link: https://github.com/yhzhu99/ehr-llm-benchmark
Paper Publication: npj Digital Medicine (2026), DOI: 10.1038/s41746-026-02539-z
Source Code Update Time: 2026-05-25

Section 03

Research Background and Motivation

With the widespread application of large language models (LLMs) such as ChatGPT and GPT-4 in the medical field, the industry has generally focused on their performance in generative tasks (e.g., medical record summarization, medical Q&A). However, there has long been a lack of systematic evaluation on the performance comparison between LLMs and traditional machine learning/deep learning methods for non-generative clinical prediction tasks—such as in-hospital mortality prediction, readmission risk assessment, and length of stay (LOS) estimation.

Clinical prediction is a core component of precision medicine. Traditional methods rely on structured electronic health record (EHR) data and use models like XGBoost, LSTM, and GRU for prediction. The emergence of LLMs brings new possibilities: Can they directly process unstructured clinical text notes? Can they exhibit stronger generalization ability in data-scarce scenarios? These questions are directly related to the selection strategy of clinical AI systems.

Section 04

ClinicRealm Research Framework

ClinicRealm, built by the AI Medicine Team of Peking University, is a comprehensive benchmark platform that systematically compares the performance of 31 different models on two types of data sources:

Section 05

Model Lineup

Large Language Models (15 types)

General-purpose LLMs: GPT-4o, GPT-5, DeepSeek-V3, Gemma-3, Qwen2.5
Medical-fine-tuned LLMs: BioGPT, Meditron, OpenBioLLM, BioMistral
Reasoning-enhanced LLMs: DeepSeek-R1 (7B/671B), HuatuoGPT-o1-7B, GPT o3-mini-high

BERT Series Models (5 types)

BERT, BioBERT, ClinicalBERT, GatorTron, Clinical-Longformer

Traditional Machine Learning Methods (11 types)

Classic ML: CatBoost, XGBoost, Random Forest, Decision Tree
Deep Learning: GRU, LSTM, RNN
Longitudinal EHR-specific models: AdaCare, ConCare, GRASP, AICare

Section 06

Datasets and Tasks

The study is based on two public medical datasets:

MIMIC-IV: Contains structured EHR data and unstructured clinical notes
TJH: Tongji Hospital COVID-19 Dataset (structured EHR)

Evaluation tasks include:

In-hospital mortality prediction
30-day readmission prediction
Length of Stay (LOS) prediction
Medical sentence matching
ICD code clustering

Section 07

Unstructured Clinical Text: LLMs Lead Across the Board

When processing clinical notes written by doctors, leading LLMs (such as DeepSeek-R1, DeepSeek-V3.1-Think, GPT-5) significantly outperformed fine-tuned BERT models in zero-shot settings. This finding is of great significance:

Zero-shot capability: Without fine-tuning for specific tasks, LLMs can directly extract predictive signals from clinical text
Text understanding advantage: LLMs demonstrate deep understanding of medical terminology and disease course descriptions
Deployment convenience: The zero-shot feature greatly reduces the deployment threshold of clinical AI systems

Section 08

Structured EHR Data: Data Volume Determines the Outcome

In structured data scenarios, the results present a more complex picture:

When data is sufficient: Specialized models (e.g., AICare, ConCare) perform best due to their dedicated modeling of longitudinal EHR sequences
When data is scarce: Advanced LLMs (e.g., GPT-4o, GPT-5, DeepSeek-V3.1-Think) can outperform traditional methods with their zero-shot capability
Practical implication: For hospitals with insufficient data accumulation or rare disease prediction, LLMs provide a feasible high-performance alternative

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54