Zing Forum

Reading

IMDb Movie Review Sentiment Analysis: A Complete NLP Practice from Text Cleaning to Hybrid Models

This article details Harshil1335's IMDb movie review sentiment analysis project, which demonstrates a complete NLP processing pipeline including data cleaning, TF-IDF feature extraction, comparison of multiple machine learning models, and an innovative hybrid model of Logistic Regression and SVM, achieving an approximate 89% classification accuracy.

自然语言处理NLP情感分析文本分类TF-IDF机器学习逻辑回归SVM朴素贝叶斯IMDb数据集
Published 2026-05-09 16:56Recent activity 2026-05-09 17:00Estimated read 5 min
IMDb Movie Review Sentiment Analysis: A Complete NLP Practice from Text Cleaning to Hybrid Models
1

Section 01

[Introduction] Overview of the IMDb Sentiment Analysis Project

The open-source imdb-sentiment-analysis-nlp project by GitHub user Harshil1335 demonstrates a complete NLP sentiment analysis pipeline, covering data cleaning, TF-IDF feature extraction, comparison of multiple machine learning models, and an innovative hybrid model. It achieves an approximate 89% accuracy on the IMDb movie review dataset, providing a reproducible learning example for NLP beginners.

2

Section 02

Project Background and Dataset Introduction

The project aims to automatically identify the sentiment tendency of movie reviews (binary classification: positive/negative). It uses the classic IMDb movie review dataset: total samples 25309, with 50% positive and 50% negative reviews; 80% for training (20247 samples) and 20% for testing (5062 samples). The balanced distribution ensures no class bias in the model.

3

Section 03

Text Preprocessing and Feature Engineering Methods

Text Preprocessing: Completed via the NLTK library, including lowercase conversion, tokenization, removal of HTML tags/punctuation/special characters, and filtering stopwords (e.g., "the", "is") to focus on sentiment words. Feature Engineering: Compared the Bag-of-Words model (simple but ignores word order) with TF-IDF (term frequency + inverse document frequency, identifies high-value words), using a 10000-dimensional TF-IDF feature space to balance information and complexity.

4

Section 04

Model Performance Comparison and Experimental Evidence

Trained and evaluated four algorithms:

  1. Naive Bayes: Bag-of-Words (84.67%), TF-IDF (86.59%)
  2. Linear SVM (TF-IDF): 88.34%
  3. Logistic Regression (TF-IDF): 88.96% (best single model)
  4. Hybrid Model (Logistic Regression + SVM): 88.82%, MCC 0.7764, FDR 0.1155, with balanced and reliable performance.

Result table:

Algorithm Feature Accuracy Precision Recall F1
Naive Bayes Bag-of-Words 84.67% 0.85 0.85 0.85
Naive Bayes TF-IDF 86.59% 0.87 0.87 0.87
Linear SVM TF-IDF 88.34% 0.88 0.88 0.88
Logistic Regression TF-IDF 88.96% 0.89 0.89 0.89
Hybrid Model TF-IDF 88.82% 0.89 0.89 0.89
5

Section 05

Summary of Experimental Results and Key Findings

  1. Feature Quality First: TF-IDF improves accuracy by 2-3 percentage points compared to Bag-of-Words; good features are more important than complex models.
  2. Linear Models Are Efficient: Linear models like SVM and Logistic Regression perform excellently; non-linear models are not necessary.
  3. Ensemble Strategy Is Effective: Although the hybrid model does not outperform the best single model, its performance is more balanced.
6

Section 06

Application Value and Expansion Directions

Application Value: Provides a complete learning example for NLP beginners covering text preprocessing, feature engineering, and model evaluation. Expansion Directions: Multilingual sentiment analysis, fine-grained rating prediction, aspect-level sentiment analysis, real-time review monitoring API deployment, etc.