Zing Forum

Reading

Dropout Risk Prediction Model for Online Learning Students Based on the OULAD Dataset

A machine learning model developed using the OULAD dataset to predict dropout risk of students in online learning environments and enable early academic intervention.

机器学习在线教育辍学预测OULAD数据集逻辑回归学习分析教育数据挖掘Streamlit
Published 2026-05-14 14:25Recent activity 2026-05-14 14:30Estimated read 6 min
Dropout Risk Prediction Model for Online Learning Students Based on the OULAD Dataset
1

Section 01

[Introduction] Core Overview of the Dropout Risk Prediction Model for Online Learning Students Based on the OULAD Dataset

This project aims to use machine learning technology to predict the dropout risk of students in online learning environments and enable early academic intervention. A logistic regression model was developed based on the Open University Learning Analytics Dataset (OULAD), achieving an overall accuracy of 76.4% on the test set and a recall rate of 67% for dropout students. The project also built an interactive web application via Streamlit to facilitate educators in obtaining real-time prediction results, helping optimize resources and make intervention decisions.

2

Section 02

Project Background and Core Research Questions

While the popularity of online education brings flexibility, its dropout rate is significantly higher than that of traditional teaching. Identifying high-risk students and intervening in a timely manner is crucial for improving educational quality. Based on the OULAD dataset (which includes records of student behavior, demographics, and academic performance), the core research question of this project is: Can student engagement, academic performance, and demographic information effectively predict dropout risk and be transformed into actionable insights?

3

Section 03

Technical Implementation and Methodology

Data Processing: Merge multiple OULAD tables, focusing on three categories: demographics, learning engagement (e.g., VLE clicks), and assessment data; Feature Engineering: Aggregate event-level data into student-level metrics (such as total clicks, median scores); Missing Value Handling: Fill clicks/scores with 0, mark categorical variables with 'Unknown'; Encoding Strategy: One-hot encoding for nominal variables, ordinal encoding for ordinal variables; Target Transformation: Convert final_result into a binary dropout variable; Model Selection: Logistic regression (standardized with StandardScaler, class_weight to balance classes).

4

Section 04

Model Performance Evaluation Results

The model achieved an overall accuracy of 76.4% on the test set. The classification report shows: non-dropout class precision 0.84, recall 0.81, F1 0.82; dropout class precision 0.61, recall 0.67, F1 0.64. The confusion matrix is [[3619 869],[669 1362]]. Interpretation: The high recall rate (67%) for the dropout class is beneficial for identifying at-risk students, while the lower precision indicates false positives, which need to be balanced based on intervention costs.

5

Section 05

Application Deployment and Educational Value

Application Deployment: Build an interactive web application via Streamlit; the process involves training the model and saving it as joblib, then writing app.py to launch the interface; Tech Stack: Python3.8+, Pandas/NumPy, Matplotlib/Seaborn, Scikit-learn, Joblib, Streamlit, Kagglehub; Educational Value: Serve as an early warning system, optimize resource allocation, and provide a practical case for learning analytics.

6

Section 06

Limitations and Improvement Directions

Limitations: Class imbalance (non-dropout is the majority), limited features (lack of qualitative factors like motivation), generalization ability to be verified; Improvement Directions: Try ensemble learning (random forest/gradient boosting), add time pattern/social interaction features, deep learning (for large-scale data), integrate SHAP to improve interpretability.

7

Section 07

Project Summary

This project is a complete educational data mining case, covering the entire process from data preprocessing to model deployment. The model achieves an accuracy of 76.4% and a dropout recall rate of 67%, and the Streamlit application lowers the threshold for use. Its open-source nature supports expansion and improvement, jointly promoting the quality of online education.