# Fake Job Detector: A Real-Time Detection System for Fake Job Postings Based on NLP and Machine Learning

> Fake Job Detector is an open-source fake job posting detection tool that combines TF-IDF text vectorization, logistic regression classifier, and rule-based risk scoring system. It provides a user-friendly web interface via Streamlit to help job seekers identify potential recruitment scams.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-13T05:26:10.000Z
- 最近活动: 2026-05-13T05:34:47.849Z
- 热度: 161.9
- 关键词: 虚假招聘检测, 自然语言处理, 机器学习, TF-IDF, 逻辑回归, Streamlit, 求职安全, 文本分类, 风险评分
- 页面链接: https://www.zingnex.cn/en/forum/thread/fake-job-detector-nlp
- Canonical: https://www.zingnex.cn/forum/thread/fake-job-detector-nlp
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Fake Job Detector: A Real-Time Detection System for Fake Job Postings Based on NLP and Machine Learning

Fake Job Detector is an open-source fake job posting detection tool that combines TF-IDF text vectorization, logistic regression classifier, and rule-based risk scoring system. It provides a user-friendly web interface via Streamlit to help job seekers identify potential recruitment scams.

## Problem Background: The Proliferation of Recruitment Scams

In the era of digital recruitment, fake job postings have become a serious problem plaguing job seekers worldwide. Scammers exploit job seekers' eagerness by posting seemingly legitimate job opportunities, but their actual intent is to defraud money, steal personal information, or induce participation in illegal activities. Common recruitment scam tactics include: fake part-time jobs promising "thousands of yuan per day", requiring prepayment of "training fees" or "deposits", contacting via non-official channels (such as Telegram), and skipping the interview process entirely for direct hiring, etc.

For inexperienced job seekers, especially fresh graduates and career changers, identifying these scam messages is often very difficult. Traditional prevention methods rely on manual review and personal experience, but in the face of massive job postings, this approach is inefficient and prone to omissions. The Fake Job Detector project was created to address this pain point. It uses natural language processing and machine learning technologies to provide job seekers with an automated tool for identifying fake job postings.

## System Architecture: Multi-Layer Detection Strategy

Fake Job Detector adopts a three-layer detection strategy, combining statistical machine learning, text analysis, and heuristic rules to build a comprehensive fake job posting identification system.

## Layer 1: TF-IDF + Logistic Regression Classifier

The core of the system is a text classification model based on TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. TF-IDF converts text into numerical vectors, capturing the importance of words in the document. Compared to the simple bag-of-words model, TF-IDF can reduce the weight of common words and highlight distinguishing keywords.

The classifier uses the logistic regression algorithm, which is a linear model with high computational efficiency and strong interpretability. The model is trained on a merged dataset of real and fake job postings to learn the feature patterns that distinguish the two types of text. After training, the model can output a probability value for a newly input job posting text, indicating the likelihood that it is a fake job posting.

## Layer 2: Rule-Based Risk Scoring

In addition to the machine learning model, the system also implements a heuristic rule-based risk scoring mechanism. These rules are derived from the summary of known recruitment scam patterns, including:

- **Currency symbols and amount patterns**: Exaggerated income promises such as "₹2000/day" or "earn 500 yuan per day"
- **Bypassing formal processes**: Phrases like "no interview required" or "direct onboarding"
- **Non-official communication channels**: Guidance such as "contact via Telegram" or "add WeChat for details"
- **Suspicious job descriptions**: Overemphasis on keywords like "zero experience", "work from home", or "easy money"

Each rule corresponds to a certain risk score, and the system accumulates the scores of all triggered rules to generate a comprehensive risk score.

## Layer 3: URL Content Crawling and Analysis

For job postings containing links, the system supports direct crawling of web content for analysis. It parses the web page HTML using the BeautifulSoup library, extracts the job description text, and then sends it to the above-mentioned classifier and risk scoring module for processing. This allows users to not only analyze manually entered text but also directly verify the authenticity of external job posting links.

## Data Layer

The project includes three main datasets:

- **fake_postings.csv**: Samples of known fake job postings
- **original_fake_jobs.csv**: Mixed job posting dataset
- **merged_fake_job_postings.csv**: Cleaned and merged training dataset

The data preprocessing process includes steps such as text cleaning (removing HTML tags and special characters), word segmentation, and stopword filtering. The project uses the NLTK library for basic natural language processing operations.

## Model Layer

The trained models are persistently stored in pickle format:

- **lrmodel.pkl**: Trained logistic regression classifier
- **vectorizer.pkl**: Fitted TF-IDF vectorizer

This design allows the models to be trained once and reused multiple times without retraining for each prediction.