# Machine Learning for Predicting Thyroid Cancer Recurrence Risk: RF and XGBoost Achieve 97.4% Accuracy

> A study combining Random Forest, XGBoost, KNN, and Deep Neural Networks leverages the UCI clinicopathological dataset to develop a high-precision model for thyroid cancer recurrence prediction, providing a new tool for early clinical decision-making.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-20T08:13:17.000Z
- 最近活动: 2026-05-20T08:18:07.840Z
- 热度: 159.9
- 关键词: 甲状腺癌, 机器学习, 深度学习, 随机森林, XGBoost, 医疗AI, 复发预测, 临床决策支持
- 页面链接: https://www.zingnex.cn/en/forum/thread/rfxgboost97-4
- Canonical: https://www.zingnex.cn/forum/thread/rfxgboost97-4
- Markdown 来源: floors_fallback

---

## Machine Learning for Predicting Thyroid Cancer Recurrence: RF and XGBoost Achieve 97.4% Accuracy (Introduction)

A study combining Random Forest (RF), XGBoost, KNN, and Deep Neural Networks uses the UCI clinicopathological dataset to predict thyroid cancer recurrence. Among them, the Random Forest (RF) and XGBoost models achieve an accuracy of 97.4%, providing a new tool for early clinical decision-making.

## Research Background and Clinical Significance

## Research Background and Clinical Significance

Thyroid cancer is one of the most common endocrine system malignancies globally, with differentiated thyroid cancer (DTC) accounting for the majority of cases. Although DTC has a good prognosis, recurrence risk is a key clinical concern. Traditional recurrence assessment relies on manual analysis of clinicopathological features, which is time-consuming and prone to subjective biases.

In recent years, machine learning has been widely applied in the medical field, enabling the identification of complex patterns for precise prediction. This open-source study proposes a systematic solution for DTC recurrence prediction.

## Data Source and Feature Engineering

## Data Source and Feature Engineering

The clinicopathological dataset from the UCI Machine Learning Repository is used, which includes multi-dimensional features such as age, gender, tumor size, pathological type, and lymph node metastasis.

Data preprocessing steps: Visual analysis to identify outliers/missing values, standardization of numerical features, and encoding of categorical variables. The dataset is split into training and test sets in an 8:2 ratio, with a random seed set to ensure reproducibility.

## Model Architecture and Algorithm Selection

## Model Architecture and Algorithm Selection

Four algorithms are compared:

### Random Forest (RF)
An ensemble learning algorithm that builds multiple decision trees and combines their results, providing feature importance evaluation.

### XGBoost
A gradient boosting algorithm that iteratively trains weak learners and combines them with weights to capture non-linear relationships.

### K-Nearest Neighbors (KNN)
An instance-based learning algorithm that classifies samples by calculating distances, suitable for small-scale datasets.

### Deep Neural Network (DNN)
Contains two hidden layers: 64 neurons (ReLU) →32 neurons (ReLU) → output layer (Sigmoid). Binary cross-entropy loss is used, with Adam optimization, trained for 50 epochs, batch size of 10.

## Experimental Results and Performance Comparison

## Experimental Results and Performance Comparison

| Model | Test Accuracy | Core Advantages |
|------|-----------|---------|
| Random Forest | 97.40% | High accuracy, suitable for reducing false positives |
| XGBoost | 97.40% | High recall rate, suitable for reducing missed diagnoses |
| Deep Neural Network | 94.81% | Excels at capturing complex feature interactions |
| KNN | 93.51% | Simple and efficient, suitable for small datasets |

RF and XGBoost tied for first place in accuracy. RF's high accuracy reduces unnecessary examinations, while XGBoost's high recall rate avoids missed diagnoses; DNN has great potential for handling complex interactions; KNN performs weaker due to the influence of noise in high-dimensional data.

## Hyperparameter Tuning and Model Validation

## Hyperparameter Tuning and Model Validation

Hyperparameter tuning for each algorithm: For RF, adjust the number of trees, maximum depth, etc.; for XGBoost, optimize learning rate, regularization coefficients, etc.; for KNN, try different numbers of neighbors and distance metrics; for DNN, adjust network structure, activation functions, etc.

Cross-validation combined with grid search ensures the scientificity of parameter selection and improves the credibility of results.

## Clinical Application Prospects and Challenges

## Clinical Application Prospects and Challenges

**Application Value**: The 97.4% accuracy helps doctors quickly assess recurrence risk and develop personalized plans; the feature importance from RF improves model interpretability and enhances doctors' trust.

**Challenges**: Issues with data format standardization; the model's generalization ability needs cross-population validation; compliance with ethical and regulatory requirements for privacy protection and algorithm fairness is necessary.

## Future Research Directions and Conclusion

## Future Research Directions

- Expand data scale: Multi-center prospective cohort data to improve robustness
- Personalized treatment: Identify high-risk groups to develop active plans
- Explainable AI: Enhance model transparency
- Multimodal fusion: Integrate text and imaging data
- Longitudinal follow-up: Time-series analysis to dynamically update risk

## Conclusion

This study demonstrates the potential of machine learning in the medical field. RF and XGBoost achieve an accuracy of 97.4%, providing a technical path for clinical decision support. The project code has been open-sourced to facilitate entry into and expansion of medical AI.