Reading

Machine Learning for Predicting Thyroid Cancer Recurrence Risk: RF and XGBoost Achieve 97.4% Accuracy

A study combining Random Forest, XGBoost, KNN, and Deep Neural Networks leverages the UCI clinicopathological dataset to develop a high-precision model for thyroid cancer recurrence prediction, providing a new tool for early clinical decision-making.

甲状腺癌机器学习深度学习随机森林XGBoost医疗AI复发预测临床决策支持

Published 2026-05-20 16:13Recent activity 2026-05-20 16:18Estimated read 8 min

Machine Learning for Predicting Thyroid Cancer Recurrence Risk: RF and XGBoost Achieve 97.4% Accuracy

Section 01

Machine Learning for Predicting Thyroid Cancer Recurrence: RF and XGBoost Achieve 97.4% Accuracy (Introduction)

A study combining Random Forest (RF), XGBoost, KNN, and Deep Neural Networks uses the UCI clinicopathological dataset to predict thyroid cancer recurrence. Among them, the Random Forest (RF) and XGBoost models achieve an accuracy of 97.4%, providing a new tool for early clinical decision-making.

Section 02

Research Background and Clinical Significance

Thyroid cancer is one of the most common endocrine system malignancies globally, with differentiated thyroid cancer (DTC) accounting for the majority of cases. Although DTC has a good prognosis, recurrence risk is a key clinical concern. Traditional recurrence assessment relies on manual analysis of clinicopathological features, which is time-consuming and prone to subjective biases.

In recent years, machine learning has been widely applied in the medical field, enabling the identification of complex patterns for precise prediction. This open-source study proposes a systematic solution for DTC recurrence prediction.

Section 03

Data Source and Feature Engineering

The clinicopathological dataset from the UCI Machine Learning Repository is used, which includes multi-dimensional features such as age, gender, tumor size, pathological type, and lymph node metastasis.

Data preprocessing steps: Visual analysis to identify outliers/missing values, standardization of numerical features, and encoding of categorical variables. The dataset is split into training and test sets in an 8:2 ratio, with a random seed set to ensure reproducibility.

Section 04

Model Architecture and Algorithm Selection

Four algorithms are compared:

Random Forest (RF)

An ensemble learning algorithm that builds multiple decision trees and combines their results, providing feature importance evaluation.

XGBoost

A gradient boosting algorithm that iteratively trains weak learners and combines them with weights to capture non-linear relationships.

K-Nearest Neighbors (KNN)

An instance-based learning algorithm that classifies samples by calculating distances, suitable for small-scale datasets.

Deep Neural Network (DNN)

Contains two hidden layers: 64 neurons (ReLU) →32 neurons (ReLU) → output layer (Sigmoid). Binary cross-entropy loss is used, with Adam optimization, trained for 50 epochs, batch size of 10.

Section 05

Experimental Results and Performance Comparison

Model	Test Accuracy	Core Advantages
Random Forest	97.40%	High accuracy, suitable for reducing false positives
XGBoost	97.40%	High recall rate, suitable for reducing missed diagnoses
Deep Neural Network	94.81%	Excels at capturing complex feature interactions
KNN	93.51%	Simple and efficient, suitable for small datasets

RF and XGBoost tied for first place in accuracy. RF's high accuracy reduces unnecessary examinations, while XGBoost's high recall rate avoids missed diagnoses; DNN has great potential for handling complex interactions; KNN performs weaker due to the influence of noise in high-dimensional data.

Section 06

Hyperparameter Tuning and Model Validation

Hyperparameter tuning for each algorithm: For RF, adjust the number of trees, maximum depth, etc.; for XGBoost, optimize learning rate, regularization coefficients, etc.; for KNN, try different numbers of neighbors and distance metrics; for DNN, adjust network structure, activation functions, etc.

Cross-validation combined with grid search ensures the scientificity of parameter selection and improves the credibility of results.

Section 07

Clinical Application Prospects and Challenges

Application Value: The 97.4% accuracy helps doctors quickly assess recurrence risk and develop personalized plans; the feature importance from RF improves model interpretability and enhances doctors' trust.

Challenges: Issues with data format standardization; the model's generalization ability needs cross-population validation; compliance with ethical and regulatory requirements for privacy protection and algorithm fairness is necessary.

Section 08

Future Research Directions and Conclusion

Future Research Directions

Expand data scale: Multi-center prospective cohort data to improve robustness
Personalized treatment: Identify high-risk groups to develop active plans
Explainable AI: Enhance model transparency
Multimodal fusion: Integrate text and imaging data
Longitudinal follow-up: Time-series analysis to dynamically update risk

Conclusion

This study demonstrates the potential of machine learning in the medical field. RF and XGBoost achieve an accuracy of 97.4%, providing a technical path for clinical decision support. The project code has been open-sourced to facilitate entry into and expansion of medical AI.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54