Reading

Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

This article introduces a phishing email detection system using TF-IDF text vectorization and Naive Bayes classifier, which achieves a classification accuracy of 97.82% on the test dataset and supports real-time email prediction functionality.

钓鱼邮件检测机器学习朴素贝叶斯TF-IDF网络安全文本分类PythonScikit-Learn

Published 2026-06-09 20:45Recent activity 2026-06-09 20:48Estimated read 5 min

Section 01

Introduction / Main Floor: Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

Section 02

Original Author and Source

Original Author/Maintainer: Anurag Upadhyay
Source Platform: GitHub
Original Title: Phishing-Email-Detector
Original Link: https://github.com/anurag21112006/Phishing-Email-Detector
Publication Date: June 9, 2026

Section 03

Background and Motivation

In the digital age, email remains the primary vector for phishing attacks. Phishing emails not only threaten the privacy and security of individual users but also serve as the main entry point for corporate data breaches. Statistics show that over 90% of cyberattacks start with phishing emails. Traditional rule-based filtering methods struggle to cope with evolving phishing techniques, so using machine learning to automatically identify phishing emails has become an important research direction in the cybersecurity field.

Section 04

Project Overview

This project is a machine learning-based phishing email detection system that can automatically classify emails into "safe emails" or "phishing emails". The system uses Natural Language Processing (NLP) technology combined with a Naive Bayes classifier to identify potential malicious emails by analyzing content features of the emails.

Section 05

Core Features

Automatic Email Classification: Label emails as safe or phishing category
TF-IDF Text Vectorization: Convert text into numerical feature vectors
Naive Bayes Machine Learning Model: Efficient probabilistic classification algorithm
Accuracy Evaluation: Quantitative metrics for model performance
Confusion Matrix Visualization: Intuitively display classification results
Real-time Email Prediction: Support instant detection of new emails
Model Persistence: Save trained models using Pickle

Section 06

Technology Stack Selection

The project uses a classic combination from the Python ecosystem:

Python: Core programming language
Pandas: Data processing and cleaning
Scikit-Learn: Machine learning algorithm implementation
Matplotlib: Visualization chart generation
Pickle: Model serialization and deserialization

Section 07

Dataset Structure

The system uses a dataset containing email text and corresponding labels for training:

Field	Description
text_combined	Email body content
label	Classification label (0 = safe email, 1 = phishing email)

Section 08

Processing Flow

The entire detection process follows a standard machine learning workflow:

Data Loading: Read the email dataset from CSV files
Text Preprocessing: Clean and standardize email text content
Feature Engineering: Convert text to numerical features using TF-IDF
Data Splitting: Split the dataset into training and test sets
Model Training: Train the classifier using Naive Bayes algorithm
Performance Evaluation: Calculate accuracy and generate confusion matrix
Real-time Prediction: Classify new emails

Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

Introduction / Main Floor: Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

Original Author and Source

Background and Motivation

Project Overview

Core Features

Technology Stack Selection

Dataset Structure

Processing Flow

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization