Zing Forum

Reading

Study on Systematic Preprocessing Methods for Machine Learning of Clinical COVID-19 Data

This project provides a complete implementation of machine learning preprocessing for clinical COVID-19 data, including the IFOSS outlier handling process, benchmark testing of six classifiers, and UMAP visualization, supporting reproducible research in multimodal clinical modeling.

COVID-19机器学习临床数据异常值检测隔离森林类别不平衡数据预处理医疗AI
Published 2026-04-08 01:26Recent activity 2026-04-08 01:53Estimated read 5 min
Study on Systematic Preprocessing Methods for Machine Learning of Clinical COVID-19 Data
1

Section 01

Introduction to the Study on Systematic Preprocessing Methods for Machine Learning of Clinical COVID-19 Data

This project provides a complete implementation of machine learning preprocessing for clinical COVID-19 data, including the IFOSS outlier handling process, benchmark testing of six classifiers, and UMAP visualization. It supports reproducible research in multimodal clinical modeling and aims to address core challenges in clinical data preprocessing such as data quality, class imbalance, feature complexity, and reproducibility.

2

Section 02

Background of Challenges in Clinical Data Preprocessing

The COVID-19 pandemic has generated massive multimodal clinical data (demographics, symptoms, laboratory results, etc.), but preprocessing faces multiple challenges: data quality issues (missing values, outliers, measurement errors), class imbalance (uneven ratio between severe and mild cases), feature complexity (complex relationships between features), and reproducibility requirements (strict demands for step documentation in medical research).

3

Section 03

Core Method: IFOSS Outlier Handling Process

IFOSS (Isolation Forest Outlier Sampling Strategy) is the core innovation, combining Isolation Forest (which quickly isolates abnormal samples through random partitioning) with the One-Sided Selection undersampling strategy. It balances class distribution while identifying and handling outliers, eliminating noise samples and alleviating class imbalance bias.

4

Section 04

Benchmark Testing Methodology and Evaluation

A stratified 80/20 split is used (outer layer: 80% training set, 20% test set; inner layer: the training set is further split into 80/20 for fitting and Optuna hyperparameter tuning). The optimization goal is to maximize the G-Mean value at the Youden's J threshold, and evaluation metrics include multi-dimensional indicators such as AUC, weighted F1 score, accuracy, balanced accuracy, and G-Mean.

5

Section 05

Visualization Analysis and Technical Implementation Details

UMAP visualization compares the distribution of original training data, independent test data, Isolation Forest-filtered data, and OSS undersampled data, helping to evaluate class separability and preprocessing rationality. Technical dependencies include Python libraries (scikit-learn, XGBoost/LightGBM/CatBoost, Optuna, UMAP, etc.), and the code includes benchmark_ifoss.py (benchmark testing) and umap_visualization.py (visualization).

6

Section 06

Application Scenarios and Limitations Notes

Application scenarios include COVID-19 severity prediction, patient risk stratification, and clinical decision support system development. The methodology can be extended to other infectious disease data, imbalanced medical datasets, and outlier detection tasks. Limitations to note include data privacy (compliance with HIPAA/GDPR), IFOSS assumption validation, and computational cost optimization (parallelization/early stopping, etc.).

7

Section 07

Project Summary and Value

This project provides a systematic solution for preprocessing clinical COVID-19 data. Through IFOSS, strict nested validation processes, and multi-classifier testing, it supports reliable and reproducible results, which is of reference value for medical AI research. In the future, it can be extended to multimodal clinical modeling (integrating imaging, time series, text, and other data).