Zing Forum

Reading

Machine Learning for Predicting Thyroid Cancer Recurrence Risk: RF and XGBoost Achieve 97.4% Accuracy

A study combining Random Forest, XGBoost, KNN, and Deep Neural Networks leverages the UCI clinicopathological dataset to develop a high-precision model for thyroid cancer recurrence prediction, providing a new tool for early clinical decision-making.

甲状腺癌机器学习深度学习随机森林XGBoost医疗AI复发预测临床决策支持
Published 2026-05-20 16:13Recent activity 2026-05-20 16:18Estimated read 8 min
Machine Learning for Predicting Thyroid Cancer Recurrence Risk: RF and XGBoost Achieve 97.4% Accuracy
1

Section 01

Machine Learning for Predicting Thyroid Cancer Recurrence: RF and XGBoost Achieve 97.4% Accuracy (Introduction)

A study combining Random Forest (RF), XGBoost, KNN, and Deep Neural Networks uses the UCI clinicopathological dataset to predict thyroid cancer recurrence. Among them, the Random Forest (RF) and XGBoost models achieve an accuracy of 97.4%, providing a new tool for early clinical decision-making.

2

Section 02

Research Background and Clinical Significance

Research Background and Clinical Significance

Thyroid cancer is one of the most common endocrine system malignancies globally, with differentiated thyroid cancer (DTC) accounting for the majority of cases. Although DTC has a good prognosis, recurrence risk is a key clinical concern. Traditional recurrence assessment relies on manual analysis of clinicopathological features, which is time-consuming and prone to subjective biases.

In recent years, machine learning has been widely applied in the medical field, enabling the identification of complex patterns for precise prediction. This open-source study proposes a systematic solution for DTC recurrence prediction.

3

Section 03

Data Source and Feature Engineering

Data Source and Feature Engineering

The clinicopathological dataset from the UCI Machine Learning Repository is used, which includes multi-dimensional features such as age, gender, tumor size, pathological type, and lymph node metastasis.

Data preprocessing steps: Visual analysis to identify outliers/missing values, standardization of numerical features, and encoding of categorical variables. The dataset is split into training and test sets in an 8:2 ratio, with a random seed set to ensure reproducibility.

4

Section 04

Model Architecture and Algorithm Selection

Model Architecture and Algorithm Selection

Four algorithms are compared:

Random Forest (RF)

An ensemble learning algorithm that builds multiple decision trees and combines their results, providing feature importance evaluation.

XGBoost

A gradient boosting algorithm that iteratively trains weak learners and combines them with weights to capture non-linear relationships.

K-Nearest Neighbors (KNN)

An instance-based learning algorithm that classifies samples by calculating distances, suitable for small-scale datasets.

Deep Neural Network (DNN)

Contains two hidden layers: 64 neurons (ReLU) →32 neurons (ReLU) → output layer (Sigmoid). Binary cross-entropy loss is used, with Adam optimization, trained for 50 epochs, batch size of 10.

5

Section 05

Experimental Results and Performance Comparison

Experimental Results and Performance Comparison

Model Test Accuracy Core Advantages
Random Forest 97.40% High accuracy, suitable for reducing false positives
XGBoost 97.40% High recall rate, suitable for reducing missed diagnoses
Deep Neural Network 94.81% Excels at capturing complex feature interactions
KNN 93.51% Simple and efficient, suitable for small datasets

RF and XGBoost tied for first place in accuracy. RF's high accuracy reduces unnecessary examinations, while XGBoost's high recall rate avoids missed diagnoses; DNN has great potential for handling complex interactions; KNN performs weaker due to the influence of noise in high-dimensional data.

6

Section 06

Hyperparameter Tuning and Model Validation

Hyperparameter Tuning and Model Validation

Hyperparameter tuning for each algorithm: For RF, adjust the number of trees, maximum depth, etc.; for XGBoost, optimize learning rate, regularization coefficients, etc.; for KNN, try different numbers of neighbors and distance metrics; for DNN, adjust network structure, activation functions, etc.

Cross-validation combined with grid search ensures the scientificity of parameter selection and improves the credibility of results.

7

Section 07

Clinical Application Prospects and Challenges

Clinical Application Prospects and Challenges

Application Value: The 97.4% accuracy helps doctors quickly assess recurrence risk and develop personalized plans; the feature importance from RF improves model interpretability and enhances doctors' trust.

Challenges: Issues with data format standardization; the model's generalization ability needs cross-population validation; compliance with ethical and regulatory requirements for privacy protection and algorithm fairness is necessary.

8

Section 08

Future Research Directions and Conclusion

Future Research Directions

  • Expand data scale: Multi-center prospective cohort data to improve robustness
  • Personalized treatment: Identify high-risk groups to develop active plans
  • Explainable AI: Enhance model transparency
  • Multimodal fusion: Integrate text and imaging data
  • Longitudinal follow-up: Time-series analysis to dynamically update risk

Conclusion

This study demonstrates the potential of machine learning in the medical field. RF and XGBoost achieve an accuracy of 97.4%, providing a technical path for clinical decision support. The project code has been open-sourced to facilitate entry into and expansion of medical AI.