Zing Forum

Reading

Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

A machine learning-based salary prediction platform that supports cross-city salary comparison analysis, using feature engineering, target encoding, and TF-IDF technologies, and provides an interactive visualization interface via Streamlit.

薪资预测机器学习Streamlit特征工程目标编码TF-IDF数据科学Python多城市对比
Published 2026-05-27 05:23Recent activity 2026-05-27 05:26Estimated read 6 min
Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform
1

Section 01

Introduction / Main Floor: Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

A machine learning-based salary prediction platform that supports cross-city salary comparison analysis, using feature engineering, target encoding, and TF-IDF technologies, and provides an interactive visualization interface via Streamlit.

3

Section 03

Project Background and Application Scenarios

In today's era of increasingly frequent global talent mobility, regional differences in salary levels have become a common focus for both job seekers and enterprises. A software engineer's salary in New York, London, Bangalore, or Singapore may differ by several times, but scientifically quantifying this difference is a complex challenge. The Megatonn project addresses this pain point by building a machine learning-based cross-city salary prediction platform.

The core value of this project lies in: users only need to input their personal career profile once, and the system can predict the salary level of that profile in different cities, helping users make more informed career decisions. This is of practical value for professionals considering cross-regional development, HR departments formulating compensation strategies, and economists researching the labor market.

4

Section 04

System Architecture and Technology Stack

Megatonn adopts a typical data science application architecture with practical and efficient technology selection:

5

Section 05

Frontend Interface

  • Streamlit: A Python framework for quickly building data applications, supporting interactive components and visualization
  • Plotly Express: Generates interactive charts to display cross-city salary comparisons
  • Bilingual Support: Built-in English and Russian interfaces to meet international needs
6

Section 06

Backend Inference

  • scikit-learn: Core machine learning library providing regression model support
  • joblib: Model serialization and deserialization
  • pandas / numpy: Data processing and numerical computation
  • scipy.sparse: Sparse matrix operations for efficient handling of high-dimensional features
7

Section 07

Feature Engineering

  • TF-IDF: Vectorized representation of text features
  • MultiLabelBinarizer: Multi-label skill encoding
  • Target Encoding: An effective way to handle high-cardinality categorical variables
8

Section 08

1. Multi-Dimensional Feature Engineering System

The project's feature engineering module is exquisitely designed, fully considering the complexity of human resource data:

Basic Numerical Features:

  • Years of work experience (experience_years)
  • Number of hard skills, number of soft skills
  • Total number of skills, ratio of hard to soft skills

Categorical Feature Encoding:

  • Frequency Encoding: Calculate the occurrence frequency of each category in the training set
  • Target Encoding: Replace the original category with the mean value of the target variable corresponding to the category
  • One-Hot Encoding: Handle low-cardinality categorical variables

Interaction Features:

  • Role-city combination (role_city)
  • Role-experience interaction (role_exp_interaction)

Text Feature Extraction:

  • Word-level TF-IDF of job titles
  • Character-level TF-IDF of job titles (captures spelling variations and abbreviations)

Salary Anchoring Features:

  • City average salary (city_avg_salary)
  • Role average salary (role_avg_salary)
  • Role-city combination average salary (role_city_avg_salary)