# SMS Spam Detection System Based on Naive Bayes

> A spam detection project achieving 98.39% accuracy using Multinomial Naive Bayes classifier and Bag-of-Words model, demonstrating a classic application of NLP in text classification tasks.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-09T00:15:41.000Z
- 最近活动: 2026-06-09T00:21:54.217Z
- 热度: 152.9
- 关键词: 垃圾短信检测, 朴素贝叶斯, NLP, 文本分类, 词袋模型, 机器学习, CountVectorizer, SMS, 分类器
- 页面链接: https://www.zingnex.cn/en/forum/thread/sms
- Canonical: https://www.zingnex.cn/forum/thread/sms
- Markdown 来源: floors_fallback

---

## Introduction: Core Overview of SMS Spam Detection System Based on Naive Bayes

This project is an SMS spam detection system based on Multinomial Naive Bayes classifier and Bag-of-Words model (CountVectorizer), achieving an accuracy of 98.39%. It is a classic application case for NLP text classification tasks.
The project is maintained by soniachaoued on GitHub (repository link: https://github.com/soniachaoued/sms-spam-detector), covering the complete workflow of text classification, suitable for beginners in NLP and machine learning to get started.

## Project Background and Dataset Introduction

### Project Background
The original project is maintained by soniachaoued and published on the GitHub platform with the repository name 'sms-spam-detector', updated in June 2026.

### Dataset Introduction
The project uses the **SMS Spam Collection Dataset**, which is maintained by the University of California, Irvine Machine Learning Repository. It contains 5,572 English SMS messages with labels divided into 'ham' (normal messages) and 'spam' (spam messages), covering various real spam types such as promotions, scams, and advertisements. It is one of the standard benchmark datasets for spam detection algorithms.

## Detailed Technical Solution: Feature Extraction and Classification Algorithm

### Feature Extraction: CountVectorizer (Bag-of-Words)
Text data needs to be converted into numerical form. The project uses the Bag-of-Words model:
1. Build vocabulary: Scan all training texts to extract unique words;
2. Vectorization: Each message is represented as a vector, with dimensions corresponding to word frequencies in the vocabulary;
3. Output: Sparse matrix (rows = messages, columns = vocabulary words, values = word frequencies).
Advantages of Bag-of-Words: Simple and intuitive, efficient computation; Disadvantages: Ignores word order and semantics, but performs well enough for spam detection tasks sensitive to keywords.

### Classification Algorithm: Multinomial Naive Bayes
Multinomial Naive Bayes is chosen as the classifier, based on Bayes' theorem, assuming feature independence. It calculates the posterior probability of a message belonging to 'ham' or 'spam' and selects the category with higher probability.
Reasons for suitability for text classification:
- Performs well on high-dimensional sparse data;
- Fast training speed and low computational complexity;
- Reliable for small-scale datasets;
- Natively supports multi-class classification (binary classification here);
- Specifically suitable for discrete count features (e.g., word frequencies).

## Performance Evaluation Results

The project achieves an accuracy of **98.39%** on the test set, meaning fewer than 2 messages are misclassified on average per 100 messages, and the performance meets the requirements of practical applications.

Reasons for choosing accuracy as the evaluation metric:
- The dataset is relatively balanced (no large gap between the number of 'ham' and 'spam' samples);
- The cost of the two types of errors (normal messages misjudged as spam vs. missed spam messages) is similar;
- Accuracy can well reflect the overall performance of the model.

## Project Workflow and Tech Stack

### Project Workflow
1. Data loading: Read the CSV-formatted SMS dataset;
2. Data preprocessing: Text cleaning and label encoding;
3. Feature engineering: Convert text to numerical features using CountVectorizer;
4. Model training: Learn parameters on the training set using MultinomialNB;
5. Performance evaluation: Calculate accuracy on the test set;
6. Prediction application: Classify new messages.

### Tech Stack
| Technology | Purpose |
|------|------|
| Python | Programming language |
| Pandas | Data loading and processing |
| Scikit-learn | Machine learning (CountVectorizer, MultinomialNB) |
| Jupyter Notebook | Development and documentation |

## Extension and Improvement Directions

Although the current accuracy has reached over 98%, optimization can still be done in the following directions:

#### Feature Engineering Improvements
- Use TF-IDF instead of word frequency to reduce the weight of common words;
- Add N-gram features (bi-grams, tri-grams) to capture local word order;
- Introduce character-level N-grams to identify deformed spam messages (e.g., 'f r e e').

#### Model Upgrades
- Try algorithms like SVM, Random Forest, and Gradient Boosting;
- Use deep learning (LSTM, BERT) to capture semantic information;
- Integrate multiple models to improve robustness.

#### Engineering Optimization
- Add cross-validation for more reliable model evaluation;
- Calculate precision, recall, and F1 score for comprehensive evaluation;
- Plot ROC curves and confusion matrices to visualize error distribution;
- Implement model persistence for easy deployment.

## Practical Application Value and Conclusion

### Practical Application Value
Spam detection is one of the successful applications of NLP, with scenarios including:
- Mobile systems (built-in filtering for iOS/Android);
- Carrier services (real-time filtering at SMS gateways);
- Enterprise communications (message review on internal platforms);
- User protection (reducing property losses from scams).
Naive Bayes is still widely used in production environments due to its efficiency and interpretability, and lightweight models are more suitable for real-time processing of massive message scenarios.

### Conclusion
This project demonstrates the typical paradigm of machine learning solving practical problems: define the problem → select data → design features → train the model → evaluate performance. The classic combination of Naive Bayes and Bag-of-Words still performs excellently in spam detection, and it is an important foundation for beginners to master complex NLP technologies.
