Zing Forum

Reading

Spotify Song Popularity Prediction: A Machine Learning Practice Based on Audio Features

A complete project using Python to analyze Spotify song data and build machine learning models for popularity prediction. Through exploratory data analysis and comparison of multiple regression algorithms, it reveals the key factors influencing song popularity.

Spotify机器学习流行度预测音乐推荐随机森林回归分析EDA音频特征Python数据科学
Published 2026-05-20 21:15Recent activity 2026-05-20 21:20Estimated read 7 min
Spotify Song Popularity Prediction: A Machine Learning Practice Based on Audio Features
1

Section 01

Introduction to the Spotify Song Popularity Prediction Project

This project is based on Spotify song data, using Python for exploratory data analysis (EDA) and machine learning modeling to predict song popularity and reveal influencing factors. Core methods include comparison of multiple regression algorithms (linear regression, decision tree, random forest, gradient boosting), with the random forest model ultimately performing the best. The project results can provide data support for music production and event planning.

2

Section 02

Project Background and Dataset Overview

Project Background

In the era of music streaming, understanding the factors of song popularity is crucial for producers (to create competitive works) and event planners (to enhance audience engagement). The goal of this project is to analyze Spotify data, explore factors influencing popularity, and build a prediction model.

Dataset

We use the Spotify Tracks Dataset from Kaggle (approximately 114,000 records, 20 fields), which includes core audio features (such as popularity, danceability, energy, etc.) and metadata (artist, genre, duration_ms, etc.). Its characteristic is that popularity is influenced by a combination of multiple features.

3

Section 03

Project Methods and Workflow

Team Division

Team GROUP6 has clear division of labor: Data engineers are responsible for the cleaning process; data quality analysts handle quality checks; EDA analysts develop exploratory analysis notebooks; visualization analysts create charts; all members participate in the modeling phase.

Key Workflow

  1. EDA: Explore data structure, feature distribution, and key relationships (e.g., loudness and popularity, genre popularity, etc.).
  2. Data Preprocessing: Column deletion, missing value/duplicate value handling, Track ID deduplication, IQR outlier handling, and feature standardization.
  3. Modeling: Test 4 regression algorithms (linear regression, decision tree, random forest, gradient boosting), evaluate using MAE/MSE/RMSE/R², and perform hyperparameter tuning.
4

Section 04

Analysis Results and Model Performance

EDA Findings

  • Songs with higher loudness have better popularity; songs with explicit content have slightly higher average popularity; pop-film, k-pop, and chill genres have prominent popularity; the star effect is significant.

Model Performance

Random forest regression achieved the best performance, effectively capturing the nonlinear relationships between features. Key audio features influencing popularity were identified through permutation importance analysis.

5

Section 05

Key Findings

Key findings of the project:

  1. Songs with high energy and high loudness are more likely to be popular;
  2. Genres like pop-film, k-pop, and chill have higher average popularity;
  3. Songs with explicit content have slightly higher popularity (related to specific genres);
  4. Popularity is determined by a combination of multiple features, with no single decisive factor;
  5. The star effect remains important in music consumption.
6

Section 06

Practical Application Recommendations

For Music Producers

Refer to the features of high-popularity songs: higher energy, loudness, and dynamic rhythm; prioritize genres like pop, k-pop, or dance-pop to meet audience preferences.

For Event Planners

Choose songs with high energy, strong rhythm, or from popular genres to enhance the on-site atmosphere and audience engagement.

7

Section 07

Technical Highlights and Conclusion

Technical Highlights

  • Complete MLOps workflow: end-to-end workflow from data collection to model evaluation;
  • Team collaboration: clear division of labor + all members participate in modeling to ensure breadth and quality;
  • Multiple model comparison and interpretability analysis: focus on business insights rather than just accuracy.

Conclusion

This project demonstrates the application value of machine learning in the music industry, covering the entire lifecycle of data science and providing a reference for learners in related fields. As AI penetrates deeper into the creative industry, such projects will become more valuable.