Zing Forum

Reading

Dual-Layer Adaptive Cybersecurity Detection System: Using NLP and Machine Learning to Combat Evolving Social Engineering Attacks

This article introduces a dual-layer adaptive system combining a random forest classifier and an expert rule engine, which can classify emails, SMS, and chat messages into 7 types of social engineering attacks with an accuracy rate of 98.18%

cybersecurityNLPmachine-learningphishing-detectionsocial-engineeringrandom-forestadaptive-learningtext-classificationfraud-detection
Published 2026-05-23 01:45Recent activity 2026-05-23 01:49Estimated read 8 min
Dual-Layer Adaptive Cybersecurity Detection System: Using NLP and Machine Learning to Combat Evolving Social Engineering Attacks
1

Section 01

Core Guide to the Dual-Layer Adaptive Cybersecurity Detection System

This article introduces the dual-layer adaptive cybersecurity detection system developed by the Usha Martin University team, which combines a random forest classifier and an expert rule engine. It can identify 7 types of social engineering attacks in emails, SMS, and other messages with an accuracy rate of 98.18%. The system has adaptive learning capabilities and can be continuously updated to respond to evolving attack patterns, providing a practical and interpretable solution for social engineering attack detection.

2

Section 02

Project Background and Motivation

In today's digital society, social engineering attack methods are constantly evolving (e.g., using psychological manipulation such as a sense of urgency or authority). Traditional detection systems based on rules or single machine learning models struggle to keep up. Therefore, the team developed this dual-layer adaptive system to address the challenge of rapidly changing attack patterns while achieving high accuracy and adaptive capabilities.

3

Section 03

System Architecture: Dual-Layer Collaborative Detection Mechanism

The core of the system is a dual-layer architecture:

  1. Random Forest Classifier: Trained on 197,909 samples, identifies 7 attack types (Safe, Phishing, Urgency Manipulation, Authority Impersonation, Financial Fraud, Malware/Suspicious Links, Credential Theft) with an accuracy rate of 98.18%.
  2. Nine-Rule Expert Engine: Captures subtle attacks missed by ML. The rule table is as follows:
    Rule Name Detection Target Severity Score
    Impersonation Authority Impersonation CRITICAL 85
    Credential Theft Urgent Request for OTP/Password CRITICAL 90
    Urgency Escalation Time Pressure + Threat Combination HIGH 70
    Context Attack Financial/Legal/Health/Work Bait HIGH 65-70
    Subtle Manipulation Flattery, False Intimacy, Scarcity MEDIUM-HIGH 20-60
    Obfuscation Distorted Text, Invisible Characters, URLs HIGH-CRITICAL 65-90
    Mixed Signal Trust Vocabulary + Attack Vocabulary Combination HIGH 70
    Malware Install Inducing App Installation / Fake Delivery Failure HIGH 72
    Safe Signals Authentic Communication Patterns LOW -30
    Decision Fusion: Threat Score = ML Score ×50% + Rule Engine Score ×50%. CRITICAL rules can override ML predictions.
4

Section 04

Feature Engineering: Multi-Dimensional Text Understanding

The system constructs a 10,012-dimensional feature vector:

  • TF-IDF Features (10,000 dimensions): Implemented with scikit-learn, extracts unigrams/bigrams, and uses sublinear_tf to reduce the impact of high-frequency words.
  • Psychological Manipulation Features (8 dimensions): Includes credential phrases, authority markers, link urgency, etc., scaled by ×10 to enhance weight.
  • Attack Pattern Features (4 dimensions): Identifies technical methods such as obfuscation, mixed signals, subtle manipulation, and context attacks.
5

Section 05

Adaptive Learning Mechanism: Combating Attack Evolution

To address attack evolution, the system is designed with an adaptive mechanism:

  • Data Segmentation: 197,909 samples are divided into T1 (training, 70%), T2a (drift set, simulating new attacks,15%), and T2b (testing,15%).
  • Experimental Results: Performance improvement after adaptation:
    Metric Before Adaptation (T1→T2b) After Adaptation (T1+T2a→T2b) Improvement
    Accuracy 97.65% 98.18% +0.53%
    Precision 97.64% 98.18% +0.54%
    Recall 97.65% 98.18% +0.53%
    F1 Score 0.9763 0.9818 +0.0055
    This proves the system can maintain detection timeliness by learning new samples.
6

Section 06

Technical Implementation and Application Value

Tech Stack: Python3.10+, scikit-learn1.8.0, NLTK3.9.4, Streamlit1.35+, pandas/NumPy. Preprocessing Flow: Raw message → lowercase → URL marking → stopword filtering → lemmatization → feature extraction. Web Deployment: Streamlit interface provides real-time threat score, risk level, attack category, detection evidence, and security recommendations. Dataset: Integrates multi-source data such as CEAS_08, phishing_email, Enron, covering channels like emails and SMS. Insights: Hybrid architecture (ML + rules) complements advantages; rule engine improves interpretability; adaptive capability is key to combating evolving attacks.

7

Section 07

Future Directions and Conclusion

Future Directions: Integrate more Hugging Face phishing datasets, explore deep learning models like BERT, develop a real-time streaming version. Conclusion: Through carefully designed feature engineering, dual-layer architecture, and adaptive mechanism, the system builds a practical, interpretable, and evolvable detection platform, providing a reference case for security teams. Project Information: Authors Mohammad Kaif et al., Institution Usha Martin University, License MIT, Code Repository: https://github.com/kaif0102/Adaptive-Detection-of-Evolving-Language-Based-Cyber-Attacks.