Reading

Hands-On Machine Learning-Based Network Intrusion Detection System: Complete Analysis of the NSL-KDD Dataset

This article details how to build an intrusion detection system using the NSL-KDD cybersecurity dataset, covering the full workflow of data preprocessing, feature engineering, model training, and evaluation, while comparing the performance of three algorithms: Random Forest, Decision Tree, and Logistic Regression.

机器学习入侵检测网络安全NSL-KDD随机森林决策树逻辑回归PythonScikit-Learn

Published 2026-05-30 00:45Recent activity 2026-05-30 00:51Estimated read 7 min

Hands-On Machine Learning-Based Network Intrusion Detection System: Complete Analysis of the NSL-KDD Dataset

Section 01

Hands-On Guide to NSL-KDD Intrusion Detection System Using Machine Learning

This article introduces how to build an intrusion detection system using the NSL-KDD dataset, covering the full workflow of data preprocessing, feature engineering, model training, and evaluation, while comparing the performance of three algorithms: Random Forest, Decision Tree, and Logistic Regression. The project is maintained by Love Solanki (B.Tech CSE, Amity University Uttar Pradesh), sourced from the GitHub repository NSL-KDD-Intrusion-Detection, and published on May 29, 2026. The core goal is to use machine learning techniques to address the problem that traditional rule-based intrusion detection methods struggle to handle evolving attacks, and to provide interpretable cybersecurity insights.

Section 02

Project Background and NSL-KDD Dataset Analysis

Project Background

In the digital age, cybersecurity threats are increasingly severe. Intrusion Detection Systems (IDS) are an important part of the defense system, but traditional rule-based methods struggle to handle evolving attack methods—machine learning provides new ideas for IDS.

NSL-KDD Dataset

This dataset was improved by Canada's Communications Security Establishment (CSE) based on 1998 DARPA data, removing redundant duplicate records to make it more suitable for model training. It includes KDDTrain+_20Percent.txt (training subset) and KDDTest+.txt (test set). Each record represents a network connection, labeled as normal (0) or attack (1). Features include numerical values (e.g., src_bytes, connection duration) and categorical values (e.g., protocol_type, service), which expand to 118-dimensional features after one-hot encoding.

Section 03

Data Preprocessing and Model Training Strategy

Data Preprocessing

Label conversion: Unify multi-class attack labels into binary classification (normal/attack);
Feature encoding: Apply one-hot encoding to categorical features (protocol_type, service, flag);
Numerical standardization: Use StandardScaler to eliminate dimensional differences;
Feature selection: Remove unnecessary columns to reduce dimensionality.

Model Selection

Logistic Regression: Baseline model with efficient computation and strong interpretability;
Decision Tree: Captures non-linear relationships and feature interactions, easy to understand via visualization;
Random Forest: Integrates multiple decision trees to reduce overfitting and improve generalization ability.

Section 04

Model Evaluation Results and Key Feature Analysis

Model Evaluation

Metrics such as accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrix are used. Random Forest performance on the test set: Accuracy 78.71%, Precision 96.86%, Recall 64.69%, F1-score 77.57%. High precision makes it suitable for false-positive sensitive scenarios.

Feature Importance

Key features identified by Random Forest include: src_bytes (source bytes), dst_bytes (destination bytes), dst_host_srv_count (destination host service count), diff_srv_rate (different service rate), flag_SF (normal close flag), covering dimensions like traffic volume and connection patterns.

Section 05

Project Architecture and Tech Stack Deployment

Project Architecture

data/: Raw and processed data;
models/: Trained model files;
notebooks/: Interactive analysis notebooks;
src/: Preprocessing, training, and evaluation scripts;
visuals/: Visualization outputs;
results/: Experimental results storage.

Tech Stack

Python, Pandas, NumPy, Scikit-Learn, Matplotlib/Seaborn, Joblib, Jupyter Notebook.

Deployment Steps

Clone the repository and enter the directory;
Create and activate a virtual environment;
Install dependencies: pip install -r requirements.txt;
Run the main program: python main.py.

Section 06

Project Summary and Future Improvement Directions

Summary

This project demonstrates the application of machine learning in cybersecurity, building a practical IDS with high detection accuracy and interpretable insights. It is suitable for developers new to cybersecurity machine learning, and its high precision can serve as a prototype foundation for production environments.

Future Improvements

Hyperparameter optimization (grid search/Bayesian optimization);
Introduce XGBoost ensemble algorithm;
Try deep learning models like LSTM and CNN;
Build a real-time detection system;
Develop a visualization dashboard to display results.