Reading

Dual-Layer Adaptive Cybersecurity Detection System: Using NLP and Machine Learning to Combat Evolving Social Engineering Attacks

This article introduces a dual-layer adaptive system combining a random forest classifier and an expert rule engine, which can classify emails, SMS, and chat messages into 7 types of social engineering attacks with an accuracy rate of 98.18%

cybersecurityNLPmachine-learningphishing-detectionsocial-engineeringrandom-forestadaptive-learningtext-classificationfraud-detection

Published 2026-05-23 01:45Recent activity 2026-05-23 01:49Estimated read 8 min

Section 01

Core Guide to the Dual-Layer Adaptive Cybersecurity Detection System

This article introduces the dual-layer adaptive cybersecurity detection system developed by the Usha Martin University team, which combines a random forest classifier and an expert rule engine. It can identify 7 types of social engineering attacks in emails, SMS, and other messages with an accuracy rate of 98.18%. The system has adaptive learning capabilities and can be continuously updated to respond to evolving attack patterns, providing a practical and interpretable solution for social engineering attack detection.

Section 02

Project Background and Motivation

In today's digital society, social engineering attack methods are constantly evolving (e.g., using psychological manipulation such as a sense of urgency or authority). Traditional detection systems based on rules or single machine learning models struggle to keep up. Therefore, the team developed this dual-layer adaptive system to address the challenge of rapidly changing attack patterns while achieving high accuracy and adaptive capabilities.

Section 03

System Architecture: Dual-Layer Collaborative Detection Mechanism

The core of the system is a dual-layer architecture:

Random Forest Classifier: Trained on 197,909 samples, identifies 7 attack types (Safe, Phishing, Urgency Manipulation, Authority Impersonation, Financial Fraud, Malware/Suspicious Links, Credential Theft) with an accuracy rate of 98.18%.

Nine-Rule Expert Engine: Captures subtle attacks missed by ML. The rule table is as follows:

Rule Name	Detection Target	Severity	Score
Impersonation	Authority Impersonation	CRITICAL	85
Credential Theft	Urgent Request for OTP/Password	CRITICAL	90
Urgency Escalation	Time Pressure + Threat Combination	HIGH	70
Context Attack	Financial/Legal/Health/Work Bait	HIGH	65-70
Subtle Manipulation	Flattery, False Intimacy, Scarcity	MEDIUM-HIGH	20-60
Obfuscation	Distorted Text, Invisible Characters, URLs	HIGH-CRITICAL	65-90
Mixed Signal	Trust Vocabulary + Attack Vocabulary Combination	HIGH	70
Malware Install	Inducing App Installation / Fake Delivery Failure	HIGH	72
Safe Signals	Authentic Communication Patterns	LOW	-30
Decision Fusion: Threat Score = ML Score ×50% + Rule Engine Score ×50%. CRITICAL rules can override ML predictions.

Section 04

Feature Engineering: Multi-Dimensional Text Understanding

The system constructs a 10,012-dimensional feature vector:

TF-IDF Features (10,000 dimensions): Implemented with scikit-learn, extracts unigrams/bigrams, and uses sublinear_tf to reduce the impact of high-frequency words.
Psychological Manipulation Features (8 dimensions): Includes credential phrases, authority markers, link urgency, etc., scaled by ×10 to enhance weight.
Attack Pattern Features (4 dimensions): Identifies technical methods such as obfuscation, mixed signals, subtle manipulation, and context attacks.

Section 05

Adaptive Learning Mechanism: Combating Attack Evolution

To address attack evolution, the system is designed with an adaptive mechanism:

Data Segmentation: 197,909 samples are divided into T1 (training, 70%), T2a (drift set, simulating new attacks,15%), and T2b (testing,15%).

Experimental Results: Performance improvement after adaptation:

Metric	Before Adaptation (T1→T2b)	After Adaptation (T1+T2a→T2b)	Improvement
Accuracy	97.65%	98.18%	+0.53%
Precision	97.64%	98.18%	+0.54%
Recall	97.65%	98.18%	+0.53%
F1 Score	0.9763	0.9818	+0.0055
This proves the system can maintain detection timeliness by learning new samples.

Section 06

Technical Implementation and Application Value

Tech Stack: Python3.10+, scikit-learn1.8.0, NLTK3.9.4, Streamlit1.35+, pandas/NumPy. Preprocessing Flow: Raw message → lowercase → URL marking → stopword filtering → lemmatization → feature extraction. Web Deployment: Streamlit interface provides real-time threat score, risk level, attack category, detection evidence, and security recommendations. Dataset: Integrates multi-source data such as CEAS_08, phishing_email, Enron, covering channels like emails and SMS. Insights: Hybrid architecture (ML + rules) complements advantages; rule engine improves interpretability; adaptive capability is key to combating evolving attacks.

Section 07

Future Directions and Conclusion

Future Directions: Integrate more Hugging Face phishing datasets, explore deep learning models like BERT, develop a real-time streaming version. Conclusion: Through carefully designed feature engineering, dual-layer architecture, and adaptive mechanism, the system builds a practical, interpretable, and evolvable detection platform, providing a reference case for security teams. Project Information: Authors Mohammad Kaif et al., Institution Usha Martin University, License MIT, Code Repository: https://github.com/kaif0102/Adaptive-Detection-of-Evolving-Language-Based-Cyber-Attacks.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54