# Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

> This article introduces a phishing email detection system using TF-IDF text vectorization and Naive Bayes classifier, which achieves a classification accuracy of 97.82% on the test dataset and supports real-time email prediction functionality.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-09T12:45:34.000Z
- 最近活动: 2026-06-09T12:48:59.803Z
- 热度: 159.9
- 关键词: 钓鱼邮件检测, 机器学习, 朴素贝叶斯, TF-IDF, 网络安全, 文本分类, Python, Scikit-Learn
- 页面链接: https://www.zingnex.cn/en/forum/thread/tf-idf97-82
- Canonical: https://www.zingnex.cn/forum/thread/tf-idf97-82
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Machine Learning-Based Phishing Email Detection System: TF-IDF and Naive Bayes Achieve 97.82% Accuracy

This article introduces a phishing email detection system using TF-IDF text vectorization and Naive Bayes classifier, which achieves a classification accuracy of 97.82% on the test dataset and supports real-time email prediction functionality.

## Original Author and Source

- **Original Author/Maintainer**: Anurag Upadhyay
- **Source Platform**: GitHub
- **Original Title**: Phishing-Email-Detector
- **Original Link**: https://github.com/anurag21112006/Phishing-Email-Detector
- **Publication Date**: June 9, 2026

## Background and Motivation

In the digital age, email remains the primary vector for phishing attacks. Phishing emails not only threaten the privacy and security of individual users but also serve as the main entry point for corporate data breaches. Statistics show that over 90% of cyberattacks start with phishing emails. Traditional rule-based filtering methods struggle to cope with evolving phishing techniques, so using machine learning to automatically identify phishing emails has become an important research direction in the cybersecurity field.

## Project Overview

This project is a machine learning-based phishing email detection system that can automatically classify emails into "safe emails" or "phishing emails". The system uses Natural Language Processing (NLP) technology combined with a Naive Bayes classifier to identify potential malicious emails by analyzing content features of the emails.

## Core Features

- **Automatic Email Classification**: Label emails as safe or phishing category
- **TF-IDF Text Vectorization**: Convert text into numerical feature vectors
- **Naive Bayes Machine Learning Model**: Efficient probabilistic classification algorithm
- **Accuracy Evaluation**: Quantitative metrics for model performance
- **Confusion Matrix Visualization**: Intuitively display classification results
- **Real-time Email Prediction**: Support instant detection of new emails
- **Model Persistence**: Save trained models using Pickle

## Technology Stack Selection

The project uses a classic combination from the Python ecosystem:

- **Python**: Core programming language
- **Pandas**: Data processing and cleaning
- **Scikit-Learn**: Machine learning algorithm implementation
- **Matplotlib**: Visualization chart generation
- **Pickle**: Model serialization and deserialization

## Dataset Structure

The system uses a dataset containing email text and corresponding labels for training:

| Field | Description |
|------|------|
| text_combined | Email body content |
| label | Classification label (0 = safe email, 1 = phishing email) |

## Processing Flow

The entire detection process follows a standard machine learning workflow:

1. **Data Loading**: Read the email dataset from CSV files
2. **Text Preprocessing**: Clean and standardize email text content
3. **Feature Engineering**: Convert text to numerical features using TF-IDF
4. **Data Splitting**: Split the dataset into training and test sets
5. **Model Training**: Train the classifier using Naive Bayes algorithm
6. **Performance Evaluation**: Calculate accuracy and generate confusion matrix
7. **Real-time Prediction**: Classify new emails
