Zing Forum

Reading

Machine Learning for Predicting Medication Adherence in Diabetic Patients: A Practice Using Zimbabwean Healthcare Data

Based on real-world data from Zimbabwe's Cimas Medical Insurance Company, this study builds classical machine learning models to predict medication adherence in patients with diabetes and hypertension. Through feature group comparison experiments and clinical cost-sensitive evaluation, it provides data-driven intervention strategies for non-communicable disease (NCD) management in sub-Saharan Africa.

机器学习医疗AI用药依从性糖尿病高血压撒哈拉以南非洲健康数据科学XGBoostSHAP可解释性成本敏感学习
Published 2026-06-06 18:16Recent activity 2026-06-06 18:26Estimated read 7 min
Machine Learning for Predicting Medication Adherence in Diabetic Patients: A Practice Using Zimbabwean Healthcare Data
1

Section 01

Introduction to Machine Learning for Predicting Medication Adherence in Diabetic Patients: A Practice Using Zimbabwean Healthcare Data

This project uses real-world data from Zimbabwe's Cimas Medical Insurance Company to build classical machine learning models for predicting medication adherence in patients with diabetes and hypertension. Through feature group comparison experiments, clinical cost-sensitive evaluation, and SHAP interpretability analysis, it provides data-driven intervention strategies for non-communicable disease (NCD) management in sub-Saharan Africa. Core objectives include verifying the predictive value of pharmacy refill and insurance data, analyzing the role of socioeconomic and clinical consumption features, identifying key predictive factors, and optimizing model performance.

2

Section 02

NCD Crisis and Medication Adherence Challenges in Sub-Saharan Africa

Sub-Saharan Africa faces a double burden of uncontrolled infectious diseases and rapidly rising NCDs. The International Diabetes Federation predicts that the prevalence of diabetes in Africa will increase by 129% by 2045. Hypertension affects about 30% of adults in the region, but its awareness and treatment rates are the lowest globally. Structural barriers in Zimbabwe (shortage of specialists, uneven access to drugs, fragmented insurance, high out-of-pocket costs) exacerbate this burden. The costs of medication non-adherence are significant: clinically, it leads to complications such as retinopathy and nephropathy; economically, hospitalization costs are 3-5 times higher than medication costs; systemically, it consumes scarce medical resources.

3

Section 03

Dataset Features and Innovative Derived Metrics

The project uses public data from Cimas Medical Aid Society covering approximately 8141 patients from January to December 2022 (source: Mendeley Data). Adherence is defined as MPR ≥75% (adherent) vs <75% (non-adherent). Innovative derived features include: cost burden ratio (annual claims/premium amount), refill interval days, refill regularity (interval standard deviation), number of units per refill, comorbidity markers, insurance tiers (basic/standard/premium), etc.

4

Section 04

Feature Group Experiments and Machine Learning Workflow

Feature group experiments are divided into three groups: Group A (socioeconomic features: insurance tier, cost burden, etc.), Group B (clinical consumption features: refill interval, regularity, etc.), Group C (combined features). The machine learning workflow includes: preprocessing (standardization/encoding), SMOTE for class imbalance handling, 70/15/15 stratified split, classifiers (logistic regression, XGBoost, etc.), RandomizedSearchCV tuning (with macro F1 as the target). Cost-sensitive evaluation is introduced, with heavier penalties for false negatives (missed non-adherence cases).

5

Section 05

Experimental Results and Feature Contribution Analysis

Among baseline models, XGBoost and Random Forest performed best. Feature group comparison showed: Group B (clinical consumption features) had performance close to the full model; Group A (socioeconomic features) had supplementary value; Group C (combined) had the optimal performance. SHAP analysis provides interpretability at the global level (key features), local level (individual patient explanations), and feature interactions.

6

Section 06

Clinical Practice and Policy Implications

Socioeconomic features can help community health workers identify high-risk patients when pharmacy data is unavailable. The cost-sensitive framework balances model performance and clinical safety. Targeted intervention strategies include: providing financial assistance to patients with high cost burdens, implementing reminder systems for patients with irregular refills, and enhancing education for patients with comorbidities.

7

Section 07

Limitations and Ethical Considerations

The dataset is limited to insured urban populations in Harare and may not be generalizable to rural/informal sectors. The model is an academic prototype and requires prospective validation. Ethically: the data is de-identified, sourced from a CC0-licensed repository, and contains no patient identity information. Deployment requires stakeholder participation and transparent communication of limitations.