With the widespread application of large language models (LLMs) such as ChatGPT and GPT-4 in the medical field, the industry has generally focused on their performance in generative tasks (e.g., medical record summarization, medical Q&A). However, there has long been a lack of systematic evaluation on the performance comparison between LLMs and traditional machine learning/deep learning methods for non-generative clinical prediction tasks—such as in-hospital mortality prediction, readmission risk assessment, and length of stay (LOS) estimation.
Clinical prediction is a core component of precision medicine. Traditional methods rely on structured electronic health record (EHR) data and use models like XGBoost, LSTM, and GRU for prediction. The emergence of LLMs brings new possibilities: Can they directly process unstructured clinical text notes? Can they exhibit stronger generalization ability in data-scarce scenarios? These questions are directly related to the selection strategy of clinical AI systems.