Zing Forum

Reading

Multi-category Text Sentiment Detection System Based on Machine Learning: A Complete Practice from TF-IDF to Sentiment Classification

This article introduces a multi-category text sentiment detection project implemented using machine learning and TF-IDF technology, covering the complete workflow of data preprocessing, feature extraction, model training, and evaluation, and compares the performance of three algorithms: Naive Bayes, SVM, and Logistic Regression.

情感分析机器学习TF-IDF自然语言处理文本分类Twitter数据逻辑回归SVM朴素贝叶斯
Published 2026-05-17 15:45Recent activity 2026-05-17 15:48Estimated read 7 min
Multi-category Text Sentiment Detection System Based on Machine Learning: A Complete Practice from TF-IDF to Sentiment Classification
1

Section 01

[Introduction] Complete Practice of Multi-category Text Sentiment Detection System Based on Machine Learning

This article introduces an open-source multi-category sentiment detection project that uses traditional machine learning techniques (TF-IDF feature extraction + Naive Bayes/SVM/Logistic Regression models) to extract sentiment information from Twitter texts and classify them into three categories: positive, negative, and neutral. The project covers the entire workflow of data preprocessing, feature extraction, model training, and evaluation, comparing the performance of the three algorithms. Logistic Regression performs the best (accuracy 60.41%), providing a complete example for beginners in sentiment analysis.

2

Section 02

Project Background and Core Objectives

Sentiment analysis is an important branch of NLP, aiming to identify subjective information in text. Unlike binary classification, multi-category sentiment detection needs to handle finer-grained emotions (such as happiness, sadness, etc.). This project chooses Twitter data because its language is informal and contains a lot of abbreviations and slang, which is challenging. The core objective is to convert raw tweets into quantifiable sentiment labels through machine learning, providing a basis for sentiment trend analysis and user behavior research.

3

Section 03

Data Preprocessing and Feature Extraction Methods

Dataset Characteristics and Merging Strategy

The tweet_emotions.csv dataset is used, which originally contains fine-grained emotions such as happiness and love, and is merged into three categories: positive, negative, and neutral (to reduce semantic overlap and class imbalance).

Text Preprocessing Workflow

  1. Cleaning: Remove URLs, special characters, numbers, and extra spaces;
  2. Tokenization and Lemmatization: Use NLTK for tokenization, and lemmatization to unify vocabulary forms (e.g., running→run);
  3. Stopword Filtering: Remove high-frequency meaningless words (e.g., the, is).

TF-IDF Feature Extraction

TF-IDF is used to measure the importance of vocabulary. Compared to the bag-of-words model, it can reduce the weight of common words and increase the weight of sentiment words (e.g., amazing, terrible), making it suitable for sentiment analysis.

4

Section 04

Comparison Experiments of Three Models and Evaluation Results

Model Comparison

  • Naive Bayes: Baseline method, computationally efficient, accuracy 39.10%;
  • SVM: Performs well in high-dimensional space, but the linear kernel fails to capture nonlinear relationships, accuracy 39.32%;
  • Logistic Regression: Maps linear combinations to the probability space, performs best, accuracy 60.41%.

Evaluation and Analysis

Metrics such as accuracy, precision, recall, F1 score, and confusion matrix are used. The sentiment merging strategy improves performance (clearer class boundaries), and TF-IDF effectively captures the distribution of sentiment keywords.

5

Section 05

Project Summary and Key Insights

This project demonstrates the application potential of traditional machine learning in sentiment analysis. The combination of TF-IDF and Logistic Regression achieves an accuracy of 60.41%. Although it is not as good as deep learning models, it has advantages such as fast training, low resource consumption, and strong interpretability, making it suitable for resource-constrained scenarios or as a baseline model. It provides a clear technical route and reproducible code for NLP beginners, and is a high-quality learning resource for understanding the entire workflow of text classification.

6

Section 06

Limitations and Future Optimization Directions

  1. Deep Learning Methods: Introduce LSTM/BERT to capture sequence information and context dependencies;
  2. Word Embedding Technology: Replace TF-IDF with Word2Vec/GloVe to capture semantic relationships;
  3. Data Balance: Use SMOTE or class weight adjustment to solve sample imbalance;
  4. Fine-grained Sentiment Recognition: Try to distinguish specific emotions such as anger and fear.
7

Section 07

Practical Value and Application Scenarios

The project's technical solution can be applied to:

  • Brand Public Opinion Monitoring: Real-time tracking of users' sentiment tendencies towards brands;
  • Customer Service Optimization: Automatically classify feedback sentiment and prioritize handling negative complaints;
  • Content Recommendation: Recommend matching content based on user sentiment;
  • Mental Health Screening: Identify potential negative emotion patterns and provide early warnings.