# Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

> A machine learning-based salary prediction platform that supports cross-city salary comparison analysis, using feature engineering, target encoding, and TF-IDF technologies, and provides an interactive visualization interface via Streamlit.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-26T21:23:15.000Z
- 最近活动: 2026-05-26T21:26:03.532Z
- 热度: 161.9
- 关键词: 薪资预测, 机器学习, Streamlit, 特征工程, 目标编码, TF-IDF, 数据科学, Python, 多城市对比
- 页面链接: https://www.zingnex.cn/en/forum/thread/megatonn-ai
- Canonical: https://www.zingnex.cn/forum/thread/megatonn-ai
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Megatonn: Technical Analysis of an AI-Driven Cross-City Salary Prediction Platform

A machine learning-based salary prediction platform that supports cross-city salary comparison analysis, using feature engineering, target encoding, and TF-IDF technologies, and provides an interactive visualization interface via Streamlit.

## Original Author and Source

- **Original Author/Maintainer**: Marilaura-Alvarado
- **Source Platform**: GitHub
- **Original Title**: megatonn_prediction_app
- **Original Link**: https://github.com/Marilaura-Alvarado/megatonn_prediction_app
- **Publication Date**: May 26, 2026

## Project Background and Application Scenarios

In today's era of increasingly frequent global talent mobility, regional differences in salary levels have become a common focus for both job seekers and enterprises. A software engineer's salary in New York, London, Bangalore, or Singapore may differ by several times, but scientifically quantifying this difference is a complex challenge. The Megatonn project addresses this pain point by building a machine learning-based cross-city salary prediction platform.

The core value of this project lies in: users only need to input their personal career profile once, and the system can predict the salary level of that profile in different cities, helping users make more informed career decisions. This is of practical value for professionals considering cross-regional development, HR departments formulating compensation strategies, and economists researching the labor market.

## System Architecture and Technology Stack

Megatonn adopts a typical data science application architecture with practical and efficient technology selection:

## Frontend Interface

- **Streamlit**: A Python framework for quickly building data applications, supporting interactive components and visualization
- **Plotly Express**: Generates interactive charts to display cross-city salary comparisons
- **Bilingual Support**: Built-in English and Russian interfaces to meet international needs

## Backend Inference

- **scikit-learn**: Core machine learning library providing regression model support
- **joblib**: Model serialization and deserialization
- **pandas / numpy**: Data processing and numerical computation
- **scipy.sparse**: Sparse matrix operations for efficient handling of high-dimensional features

## Feature Engineering

- **TF-IDF**: Vectorized representation of text features
- **MultiLabelBinarizer**: Multi-label skill encoding
- **Target Encoding**: An effective way to handle high-cardinality categorical variables

## 1. Multi-Dimensional Feature Engineering System

The project's feature engineering module is exquisitely designed, fully considering the complexity of human resource data:

**Basic Numerical Features**:
- Years of work experience (experience_years)
- Number of hard skills, number of soft skills
- Total number of skills, ratio of hard to soft skills

**Categorical Feature Encoding**:
- Frequency Encoding: Calculate the occurrence frequency of each category in the training set
- Target Encoding: Replace the original category with the mean value of the target variable corresponding to the category
- One-Hot Encoding: Handle low-cardinality categorical variables

**Interaction Features**:
- Role-city combination (role_city)
- Role-experience interaction (role_exp_interaction)

**Text Feature Extraction**:
- Word-level TF-IDF of job titles
- Character-level TF-IDF of job titles (captures spelling variations and abbreviations)

**Salary Anchoring Features**:
- City average salary (city_avg_salary)
- Role average salary (role_avg_salary)
- Role-city combination average salary (role_city_avg_salary)
