Reading

Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

A machine learning-based salary prediction platform that supports cross-city salary comparison analysis, using feature engineering, target encoding, and TF-IDF technologies, and provides an interactive visualization interface via Streamlit.

薪资预测机器学习Streamlit特征工程目标编码TF-IDF数据科学Python多城市对比

Published 2026-05-27 05:23Recent activity 2026-05-27 05:26Estimated read 6 min

Section 01

Introduction / Main Floor: Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

Section 02

Original Author and Source

Original Author/Maintainer: Marilaura-Alvarado
Source Platform: GitHub
Original Title: megatonn_prediction_app
Original Link: https://github.com/Marilaura-Alvarado/megatonn_prediction_app
Publication Date: May 26, 2026

Section 03

Project Background and Application Scenarios

In today's era of increasingly frequent global talent mobility, regional differences in salary levels have become a common focus for both job seekers and enterprises. A software engineer's salary in New York, London, Bangalore, or Singapore may differ by several times, but scientifically quantifying this difference is a complex challenge. The Megatonn project addresses this pain point by building a machine learning-based cross-city salary prediction platform.

The core value of this project lies in: users only need to input their personal career profile once, and the system can predict the salary level of that profile in different cities, helping users make more informed career decisions. This is of practical value for professionals considering cross-regional development, HR departments formulating compensation strategies, and economists researching the labor market.

Section 04

System Architecture and Technology Stack

Megatonn adopts a typical data science application architecture with practical and efficient technology selection:

Section 05

Frontend Interface

Streamlit: A Python framework for quickly building data applications, supporting interactive components and visualization
Plotly Express: Generates interactive charts to display cross-city salary comparisons
Bilingual Support: Built-in English and Russian interfaces to meet international needs

Section 06

Backend Inference

scikit-learn: Core machine learning library providing regression model support
joblib: Model serialization and deserialization
pandas / numpy: Data processing and numerical computation
scipy.sparse: Sparse matrix operations for efficient handling of high-dimensional features

Section 07

Feature Engineering

TF-IDF: Vectorized representation of text features
MultiLabelBinarizer: Multi-label skill encoding
Target Encoding: An effective way to handle high-cardinality categorical variables

Section 08

1. Multi-Dimensional Feature Engineering System

The project's feature engineering module is exquisitely designed, fully considering the complexity of human resource data:

Basic Numerical Features:

Years of work experience (experience_years)
Number of hard skills, number of soft skills
Total number of skills, ratio of hard to soft skills

Categorical Feature Encoding:

Frequency Encoding: Calculate the occurrence frequency of each category in the training set
Target Encoding: Replace the original category with the mean value of the target variable corresponding to the category
One-Hot Encoding: Handle low-cardinality categorical variables

Interaction Features:

Role-city combination (role_city)
Role-experience interaction (role_exp_interaction)

Text Feature Extraction:

Word-level TF-IDF of job titles
Character-level TF-IDF of job titles (captures spelling variations and abbreviations)

Salary Anchoring Features:

City average salary (city_avg_salary)
Role average salary (role_avg_salary)
Role-city combination average salary (role_city_avg_salary)

Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

Introduction / Main Floor: Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

Original Author and Source

Project Background and Application Scenarios

System Architecture and Technology Stack

Frontend Interface

Backend Inference

Feature Engineering

1. Multi-Dimensional Feature Engineering System

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants