Reading

Unsupervised Machine Learning for Network Attack Detection: Practical Analysis of PCA, K-Means, Isolation Forest, and LOF

Based on the CICIDS2017 dataset, this study uses unsupervised learning techniques such as PCA dimensionality reduction, K-Means clustering, Isolation Forest, and Local Outlier Factor (LOF) to detect network attacks, and compares the effectiveness between classical statistical methods and machine learning methods.

网络安全异常检测无监督学习孤立森林LOFPCA降维入侵检测机器学习CICIDS2017

Published 2026-06-15 12:15Recent activity 2026-06-15 12:18Estimated read 9 min

Unsupervised Machine Learning for Network Attack Detection: Practical Analysis of PCA, K-Means, Isolation Forest, and LOF

Section 01

Introduction to Practical Analysis of Unsupervised Machine Learning for Network Attack Detection

This project is based on the CICIDS2017 dataset, exploring the use of unsupervised learning techniques such as PCA dimensionality reduction, K-Means clustering, Isolation Forest, and Local Outlier Factor (LOF) to detect network attacks, and comparing the effectiveness differences between classical statistical methods (e.g., Z-Score) and machine learning methods. The core goal is to verify the effectiveness of unsupervised learning in identifying malicious network behaviors without labeled attack samples, providing technical references for network security defense.

Section 02

Project Background and Data Challenges

Project Background

Network security threats are becoming increasingly complex, and traditional signature-based intrusion detection systems struggle to handle new types of attacks. This project aims to use unsupervised learning techniques to automatically detect abnormal behaviors without pre-labeled attack samples. The CICIDS2017 real network traffic dataset (containing multiple attack types) is selected as the experimental basis.

Data Challenges

Heavy-tailed Distribution: Network traffic data is extremely imbalanced, with a small number of anomalies mixed with a large number of normal samples, challenging traditional statistical methods;
Multicollinearity: The 51-dimensional features are highly correlated, leading to model redundancy and inefficient computation;
Curse of Dimensionality: Distance metrics in high-dimensional spaces fail, affecting the performance of clustering and anomaly detection algorithms.

Section 03

Core Methods: Dimensionality Reduction, Clustering, and Anomaly Detection

Dimensionality Reduction Strategy

PCA: Retain 15 principal components, compress 51-dimensional features to low dimensions, preserve over 90% of variance, alleviate the curse of dimensionality, and improve efficiency;
t-SNE: Assists in visualizing the local structure of high-dimensional data and reveals distribution differences between normal and attack traffic.

Clustering Analysis

Compare three algorithms:

K-Means: Identifies large-scale dense clusters of normal traffic, but has limited sensitivity to irregular attack clusters;
DBSCAN: Automatically identifies noise points, suitable for scattered abnormal traffic, but parameter tuning has a significant impact;
Hierarchical Clustering: Provides a tree-like clustering structure, helping to understand the similarity of attack types.

Anomaly Detection Methods

Z-Score: A classical statistical method that assumes a normal distribution and fails due to the heavy-tailed nature of network traffic;
Isolation Forest: An efficient tree-based algorithm that excels at detecting global anomalies (extreme attacks at the edge of the distribution);
LOF: Identifies local anomalies (stealthy attacks lurking near normal traffic) through local density differences.

Section 04

Experimental Results Comparison: Classical Statistics vs. Machine Learning

Limitations of Z-Score

Experiments show that Z-Score fails catastrophically on network traffic data, with reasons including:

Assuming a normal distribution, which conflicts with the heavy-tailed nature of network traffic;
Lack of local context awareness, unable to distinguish between global and local anomalies.

Advantages of Machine Learning Methods

Isolation Forest: Effectively identifies global anomalies (extreme attacks at the edge of the distribution);
LOF: Successfully captures local anomalies (attacks lurking near normal traffic); The two are complementary, and their combined use can cover more attack scenarios.

Key Role of Dimensionality Reduction

PCA dimensionality reduction not only improves computational efficiency but also restores the effectiveness of distance metrics, enabling subsequent algorithms to work properly.

Section 05

Key Findings and Insights

Algorithm Complementarity: Isolation Forest and LOF are not competitive; the former excels at global anomalies, while the latter excels at local anomalies, and their combined use provides more comprehensive protection;
Curse of Dimensionality Mitigation: PCA dimensionality reduction is a key preprocessing step for handling high-dimensional network data;
Limitations of Statistical Methods: Classical statistical methods (e.g., Z-Score) perform poorly on complex network data due to their assumption of simple data distributions, while machine learning methods are more adaptable to nonlinear structures.

Section 06

Practical Significance and Application Recommendations

Practical Value

Unsupervised learning does not require expensive labeling of attack samples, reducing deployment barriers; the algorithm combination strategy (dimensionality reduction + clustering + anomaly detection) can serve as a prototype architecture for intrusion detection systems.

Reproduction Recommendations

The cleaned CICIDS2017 data is approximately 700MB (exceeding GitHub's limit). Preprocessed data can be downloaded from Kaggle and placed in the project root directory to run the code.

Section 07

Project Summary

This project verifies the effectiveness of unsupervised machine learning in network anomaly detection through systematic comparative experiments. Core conclusions:

There is no single optimal algorithm; Isolation Forest and LOF each have their strengths;
PCA dimensionality reduction is key to handling high-dimensional network data;
Classical statistical methods have obvious limitations in complex scenarios. These insights lay the foundation for building an intelligent adaptive network security defense system.