Zing Forum

Reading

Data Science Salary Prediction: Real Lessons When Neural Networks Meet Small Datasets

An honest machine learning project experiment record showing the process of predicting data science salaries using PyTorch neural networks and random forests, and why the models performed poorly—with the key lesson being data quality rather than algorithm selection.

机器学习薪资预测PyTorch随机森林过拟合数据质量特征工程回归分析神经网络
Published 2026-06-16 23:44Recent activity 2026-06-16 23:51Estimated read 6 min
Data Science Salary Prediction: Real Lessons When Neural Networks Meet Small Datasets
1

Section 01

Introduction: Real Lessons from Data Science Salary Prediction

This project records the experimental process of predicting data science salaries using PyTorch neural networks and random forests. The core lesson is that the key reason for poor model performance is data quality rather than algorithm selection. The project's honesty—openly discussing failures—has important value for machine learning education.

2

Section 02

Project Background: The Significance of Honest Failure Cases

The machine learning field often flaunts SOTA results but rarely discloses failures. This project is noteworthy for honestly showing code implementation and the reasons for poor model performance. This transparency reminds us: machine learning is not just about tuning parameters and selecting algorithms; it's more about understanding data, identifying limitations, and honestly evaluating results.

3

Section 03

Technology Selection and Data Preprocessing

Project Objectives: Predict salaries using features like experience level and company size. Technology Selection:

  • Neural Network (PyTorch): Multi-layer fully connected network with Batch Normalization, trained using HuberLoss, Adam optimizer, 10000 epochs, gradient clipping, etc.
  • Random Forest (scikit-learn): For simple model comparison. Data Preprocessing:
  • Feature Engineering: Ordinal encoding (experience level, company size), one-hot encoding (employment type), retaining remote work ratio.
  • Data Cleaning: Removing high-cardinality columns (years of work, job title, etc.).
  • Leakage Prevention: StandardScaler only fits training data.
4

Section 04

Experimental Results and Failure Cause Analysis

Model Performance Comparison:

Model Training Set R² Test Set R²
Neural Network ~0.33 -1.42
Random Forest 0.35 0.26
Failure Reasons:
  1. Insufficient feature information: Only 4 core features, unable to capture salary differences.
  2. Excessive salary variance: Salary distribution is scattered under the same features.
  3. Neural network overfitting: 10000 epochs are too many for small datasets, leading to poor generalization.
  4. Random forest is relatively better: Tree models outperform deep learning for small-scale tabular data.
5

Section 05

Key Lesson: Data Quality Trumps Algorithm Selection

Core Insight: The upper limit of a model is determined by data rather than algorithms. To increase R² to above 0.6, additional features are needed: specific job titles, precise years of work, city/region, company name, industry sector. Reminder: Before modeling, confirm whether there are good enough features to support the prediction task.

6

Section 06

Technical Highlights: Praiseworthy Code Practices

Despite poor results, the code has several good practices:

  • Custom PyTorch dataset encapsulates loading logic.
  • Batch Normalization after each layer to stabilize training.
  • Choosing Huber Loss to handle outliers.
  • Gradient clipping to prevent explosion.
  • Using R² to evaluate training and test performance to diagnose overfitting.
7

Section 07

Advice for Learners

  1. Start with simple models: Build a baseline first (e.g., random forest/linear regression). If simple models perform poorly, complex models are unlikely to do better.
  2. Understand data: Explore distributions, correlations, and limitations before modeling.
  3. Be honest about results: A negative R² is not a shame; pretending it's good is. Failures teach more.
  4. Value feature engineering: Spending time collecting better features has a bigger impact than tuning models.
8

Section 08

Conclusion: The Value of Failure

Although this project has no top conference results, its value is significant:

  • Not all problems are suitable for deep learning.
  • Data quality is the fundamental limitation of model performance.
  • Honest reporting of failures is the foundation of scientific progress.
  • Simple methods are often the best starting point. For learners, studying this project is more educational than tutorials with 99% accuracy, as it shows the complexity of the real world and the importance of critical thinking.