Zing Forum

Reading

Content-based Filtering Movie Recommendation System: Build a Personalized Movie Assistant with Python and Machine Learning

Introduce an open-source content-based movie recommendation system using Python, Streamlit, and machine learning techniques. It recommends similar movies to users by analyzing movie metadata (genres, keywords, cast, crew, etc.), helping users discover films that match their personal taste.

推荐系统机器学习Python内容过滤Streamlit电影推荐自然语言处理余弦相似度开源项目
Published 2026-06-10 10:15Recent activity 2026-06-10 10:19Estimated read 7 min
Content-based Filtering Movie Recommendation System: Build a Personalized Movie Assistant with Python and Machine Learning
1

Section 01

Introduction / Main Floor: Content-based Filtering Movie Recommendation System: Build a Personalized Movie Assistant with Python and Machine Learning

Introduce an open-source content-based movie recommendation system using Python, Streamlit, and machine learning techniques. It recommends similar movies to users by analyzing movie metadata (genres, keywords, cast, crew, etc.), helping users discover films that match their personal taste.

3

Section 03

Introduction: Daily Value of Recommendation Systems

In today's era of flourishing streaming platforms, "What to watch tonight?" has become a daily dilemma for many people. Platforms like Netflix, Amazon Prime, and Disney+ release a large amount of new content every day, but users often face choice paralysis. An excellent recommendation system not only enhances user experience but also helps platforms increase user stickiness and viewing duration.

The open-source project Movie-Recommended-System introduced in this article demonstrates a lightweight yet fully functional implementation of a content-based movie recommendation system. This project uses a pure Python tech stack and combines Streamlit to build an interactive web interface, allowing developers to quickly understand the core principles of recommendation systems and extend them based on this foundation.


4

Section 04

Project Overview and Technical Architecture

This is a content-based filtering movie recommendation system. Unlike collaborative filtering, which relies on user behavior data, content-based filtering calculates similarity entirely based on the feature attributes of the movies themselves. Therefore, it has better adaptability to "cold start" scenarios for new users and new items.

5

Section 05

Core Tech Stack

Technology Purpose
Python Main development language
Streamlit Quickly build interactive web interfaces
Pandas / NumPy Data processing and numerical computation
Scikit-Learn Machine learning library providing CountVectorizer and cosine similarity calculation
NLTK Natural language processing, text cleaning and tokenization
Pickle Serialization storage for models and data
6

Section 06

System Architecture Design

The project adopts a classic three-tier architecture:

  1. Data Layer: Preprocessed movie dataset (movies.pkl) and precomputed similarity matrix (similarity.pkl)
  2. Algorithm Layer: Text vectorization based on bag-of-words model and cosine similarity calculation
  3. Presentation Layer: Streamlit-driven single-page web application providing movie selection and detail display

7

Section 07

Feature Engineering and Data Fusion

The quality of a recommendation system largely depends on the design of input features. This project cleverly integrates multi-dimensional movie metadata:

  • Genres: Tags like action, comedy, sci-fi, etc.
  • Keywords: Thematic words related to the plot
  • Cast: Information about main actors
  • Crew: Key personnel such as directors and screenwriters
  • Overview: Text description of the movie

These features are merged into a comprehensive text field to form a "feature signature" for each movie. The advantage of this fusion strategy is that it can capture the multi-faceted nature of a movie—for example, Inception is not only a sci-fi film but also involves elements of dreams, suspense, and action, while being directed by Christopher Nolan and starring Leonardo DiCaprio.

8

Section 08

Text Vectorization: Application of CountVectorizer

The project uses Scikit-Learn's CountVectorizer to convert text features into numerical vectors. The specific process is as follows:

  1. Text Cleaning: Remove stop words, punctuation, and unify case
  2. Tokenization: Split composite text into individual words
  3. Word Frequency Statistics: Build a Bag of Words model to count the occurrence frequency of each word in each movie
  4. Vectorization Representation: Each movie is represented as a high-dimensional sparse vector

Although CountVectorizer is simpler than TF-IDF or word embedding (Word2Vec), it can achieve good results in this scenario, with low computational cost and easy understanding and debugging.