Zing Forum

Reading

Machine Learning Applications in Breast Cancer Diagnosis: A Complete Practice of SVM and PCA Dimensionality Reduction

A breast cancer tumor classification study achieving 98.25% accuracy, which deeply explores the practical applications and performance trade-offs of Support Vector Machines (SVM) and Principal Component Analysis (PCA) in medical diagnosis.

machine learningbreast cancerSVMPCAmedical diagnosisclassificationdimensionality reductionscikit-learnhealthcare AI
Published 2026-06-12 09:46Recent activity 2026-06-12 09:53Estimated read 7 min
Machine Learning Applications in Breast Cancer Diagnosis: A Complete Practice of SVM and PCA Dimensionality Reduction
1

Section 01

Introduction: Research on the Application of SVM and PCA in Breast Cancer Diagnosis

This article introduces a machine learning study based on digital image features of Fine Needle Aspiration (FNA) biopsies. It implements benign/malignant classification of breast cancer tumors using Support Vector Machine (SVM) and K-Nearest Neighbors (K-NN) algorithms, and explores the impact of Principal Component Analysis (PCA) dimensionality reduction on model performance. Finally, SVM with RBF kernel achieved an accuracy of 98.25% on the original features, providing a practical solution for medical AI-assisted diagnosis.

2

Section 02

Research Background and Clinical Significance

Breast cancer is one of the most common malignant tumors among women worldwide, and early diagnosis is crucial for improving the cure rate. Traditional pathological diagnosis relies on doctors' experience, while machine learning technology provides new possibilities for auxiliary diagnosis. The core question of this study: How to reduce model complexity while maintaining high accuracy? By comparing the original feature space with the PCA-reduced feature space, it reveals the practical value of dimensionality reduction in medical machine learning.

3

Section 03

Dataset Overview

The study uses the Kaggle Diagnostic Breast Cancer Dataset, which contains 569 samples, each with 30 numerical features (describing nuclear morphological and texture features: radius, texture, perimeter, area, smoothness, concavity, concave points, symmetry, fractal dimension, etc.). The target variable is binary: benign (about 63%) and malignant (about 37%), with no missing values in the dataset.

4

Section 04

Methodology: Data Processing and Model Construction

Data Preprocessing: Remove irrelevant ID columns, label encoding (benign=0/malignant=1), select features based on point-biserial correlation coefficient, fit StandardScaler on the training set and apply it to the test set to prevent data leakage. PCA Application: Fit PCA on the standardized training data, retain 10 principal components (explaining over 95% of variance), achieving a 66.7% dimensionality reduction. Model Selection and Tuning: Compare SVM (linear kernel/RBF kernel parameter tuning) and K-NN (K-value search), split the dataset into 80/20 stratified partitions to ensure consistent class proportions.

5

Section 05

Experimental Results and Key Findings

Best Model Performance: SVM with RBF kernel (C=10, γ=0.01) achieved an accuracy of 98.25%, precision of 100%, recall of 95.24% on the original 30-dimensional features, with only 2 false negative errors. PCA Dimensionality Reduction Comparison: Reducing to 10 dimensions only caused the F1 score to drop by less than 1.2%, linear SVM performance slightly improved, and K-NN performance remained unchanged. Key Findings: PCA dimensionality reduction has high practical value; linear models benefit from dimensionality reduction; SVM outperforms K-NN; RBF kernel is suitable for original features, and linear kernel with PCA achieves similar results.

6

Section 06

Technical Implementation and Reproducibility

The project uses the Python ecosystem: scikit-learn (core algorithms), pandas/NumPy (data processing), matplotlib/seaborn (visualization), Jupyter Notebook (interactive development). A complete Notebook file is provided, covering the entire process from data loading to model evaluation, ensuring the reproducibility of experimental results.

7

Section 07

Practical Insights and Future Directions

Insights: Medical data features have high correlation, PCA can reduce costs and improve efficiency; SVM is more robust for small-sample high-dimensional data; recall rate is more important than accuracy in medical diagnosis; strict prevention of data leakage is necessary. Future Directions: Explore deep learning models, expand multi-modal data, add model interpretability analysis (e.g., SHAP values).

8

Section 08

Conclusion: Practical Value of Medical AI

This study demonstrates the potential of machine learning in medical diagnosis, achieves high accuracy through rigorous experiments, and reveals the value of dimensionality reduction. PCA dimensionality reduction combined with linear SVM provides a solution that balances performance and efficiency. For medical AI learners, this project is a complete and reproducible entry-level case covering the entire machine learning process.