Zing Forum

Reading

Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

This article introduces a phishing email detection system using TF-IDF text vectorization and Naive Bayes classifier, which achieves a classification accuracy of 97.82% on the test dataset and supports real-time email prediction functionality.

钓鱼邮件检测机器学习朴素贝叶斯TF-IDF网络安全文本分类PythonScikit-Learn
Published 2026-06-09 20:45Recent activity 2026-06-09 20:48Estimated read 5 min
Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy
1

Section 01

Introduction / Main Floor: Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

This article introduces a phishing email detection system using TF-IDF text vectorization and Naive Bayes classifier, which achieves a classification accuracy of 97.82% on the test dataset and supports real-time email prediction functionality.

3

Section 03

Background and Motivation

In the digital age, email remains the primary vector for phishing attacks. Phishing emails not only threaten the privacy and security of individual users but also serve as the main entry point for corporate data breaches. Statistics show that over 90% of cyberattacks start with phishing emails. Traditional rule-based filtering methods struggle to cope with evolving phishing techniques, so using machine learning to automatically identify phishing emails has become an important research direction in the cybersecurity field.

4

Section 04

Project Overview

This project is a machine learning-based phishing email detection system that can automatically classify emails into "safe emails" or "phishing emails". The system uses Natural Language Processing (NLP) technology combined with a Naive Bayes classifier to identify potential malicious emails by analyzing content features of the emails.

5

Section 05

Core Features

  • Automatic Email Classification: Label emails as safe or phishing category
  • TF-IDF Text Vectorization: Convert text into numerical feature vectors
  • Naive Bayes Machine Learning Model: Efficient probabilistic classification algorithm
  • Accuracy Evaluation: Quantitative metrics for model performance
  • Confusion Matrix Visualization: Intuitively display classification results
  • Real-time Email Prediction: Support instant detection of new emails
  • Model Persistence: Save trained models using Pickle
6

Section 06

Technology Stack Selection

The project uses a classic combination from the Python ecosystem:

  • Python: Core programming language
  • Pandas: Data processing and cleaning
  • Scikit-Learn: Machine learning algorithm implementation
  • Matplotlib: Visualization chart generation
  • Pickle: Model serialization and deserialization
7

Section 07

Dataset Structure

The system uses a dataset containing email text and corresponding labels for training:

Field Description
text_combined Email body content
label Classification label (0 = safe email, 1 = phishing email)
8

Section 08

Processing Flow

The entire detection process follows a standard machine learning workflow:

  1. Data Loading: Read the email dataset from CSV files
  2. Text Preprocessing: Clean and standardize email text content
  3. Feature Engineering: Convert text to numerical features using TF-IDF
  4. Data Splitting: Split the dataset into training and test sets
  5. Model Training: Train the classifier using Naive Bayes algorithm
  6. Performance Evaluation: Calculate accuracy and generate confusion matrix
  7. Real-time Prediction: Classify new emails