Zing Forum

Reading

Application of Large Language Models in SOC Alert Classification and Priority Ranking: Potential and Limitations

An empirical study on the performance of mainstream large language models like GPT-4o and DeepSeek in SOC alert processing, revealing AI's potential in threat detection and challenges in priority ranking.

大语言模型安全运营中心SOC告警分类威胁检测GPT-4oDeepSeek网络安全AI安全告警疲劳
Published 2026-05-12 02:52Recent activity 2026-05-12 02:59Estimated read 9 min
Application of Large Language Models in SOC Alert Classification and Priority Ranking: Potential and Limitations
1

Section 01

[Introduction] Study on the Potential and Limitations of Large Language Models in SOC Alert Processing

This study conducts an empirical investigation into the performance of mainstream large language models (LLMs) such as GPT-4o and DeepSeek in the tasks of alert classification and priority ranking in Security Operations Centers (SOCs). The results show that LLMs exhibit high recall potential in alert classification tasks but have a high false positive rate; their performance in priority ranking tasks is significantly insufficient. The research conclusion points out that AI should serve as an auxiliary tool for SOC analysts, and human-machine collaboration is needed to balance automation and manual judgment to improve operational efficiency.

2

Section 02

Research Background and Motivation: The Challenge of SOC Alert Fatigue

Research Background and Motivation

In the digital transformation era, enterprises face complex cyber threats. As the security nerve center, SOCs need to process massive alerts every day—large enterprises handle an average of thousands of alerts daily, most of which are false positives, leading to "alert fatigue" that consumes analysts' time and may cause real threats to be missed. Traditional alert classification and priority ranking rely on manual experience, which is time-consuming, labor-intensive, and prone to subjective bias. With the rise of LLMs, the industry is exploring the integration of AI into SOC processes. This study aims to evaluate the actual performance of general-purpose LLMs in these tasks to provide references for security teams.

3

Section 03

Experimental Environment and Research Methods

Experimental Environment and Dataset Construction

A simulated SOC environment was built with core components including Wazuh (SIEM), Suricata (IDS), Windows Server 2019 domain environment, Windows 11 clients, Linux vulnerable application servers, and Kali Linux attack machines. Alerts were triggered using the Atomic Red Team framework plus manual attacks, and 178 alert records in JSON format were exported:

  • Classification dimension: 104 real alerts, 74 false positives
  • Priority dimension: 136 low, 12 medium, 25 high, 5 critical

Research Methods and Technical Route

Four phases:

  1. Preprocessing: Scripts clean and label data, merging real/false positive alerts into a unified dataset
  2. Model Inference: Test 7 mainstream models (GPT-4o series, DeepSeek-Chat/Reasoner) to complete classification (real/false positive) and priority ranking tasks
  3. Postprocessing: Standardize model results
  4. Evaluation: Analyze using metrics like accuracy, precision, recall, etc.

List of tested models:

Model Name Version/Snapshot
GPT-4o gpt-4o-2024-08-06
GPT-4.1 gpt-4.1-2025-04-14
GPT-4.5 Preview gpt-4.5-preview-2025-02-27
GPT-4o mini gpt-4o-mini-2024-07-18
GPT-4.1 mini gpt-4.1-mini-2024-04-14
DeepSeek-Chat DeepSeek-V3-0324
DeepSeek-Reasoner DeepSeek-R1-0528
4

Section 04

Key Findings: Classification Has Potential, Priority Ranking Needs Improvement

Alert Classification: AI Shows Strong Potential

All models performed well in distinguishing real vs. false positive alerts:

  • Best Performance: GPT-4o mini achieved a recall rate of 95.19% (low missed alert rate)
  • Challenge: GPT-4o mini had a false positive rate of 72.97%, which may still lead to alert fatigue
  • Trade-off: In security scenarios, the cost of missed alerts is higher than false positives, so high recall is reasonable, but the false positive rate needs to be balanced

Priority Ranking: A Clear Shortcoming of Current Models

All models performed poorly:

  • Best Performance: GPT-4.1 had a macro-average recall rate of 34.59% and an accuracy of 49.44% (slightly better than random guesses)
  • Reasons: Need for organization-specific context, subjective priority classification, unbalanced data distribution (low priority accounts for 76%)
  • Insight: AI priority recommendations need manual confirmation; human-machine collaboration is more reliable
5

Section 05

Practical Insights and Application Recommendations

Insights for SOC Operations

Layered application strategy:

  1. Alert Pre-screening: Use AI's high recall feature to filter obvious false positives, reducing manual review volume
  2. Real Threat Confirmation: AI marks high-confidence real alerts for quick response; edge cases are reviewed manually
  3. Priority Assistance: AI outputs serve as references, combined with rules and manual judgment

Technical Implementation Recommendations

  1. Start with small-scale pilots and expand gradually
  2. Establish a feedback loop to optimize models using analysts' judgments
  3. Focus on cost-effectiveness and select models with the best cost-performance ratio
  4. Maintain manual supervision; position AI as an assistant rather than a replacement
6

Section 06

Research Limitations and Future Directions

Research Limitations

  1. The experimental environment is simulated, which may differ from actual production environments
  2. The dataset size is small (178 records), covering a limited range of security event types

Future Directions

  1. Expand to larger-scale and diverse alert datasets
  2. Explore the performance of models fine-tuned for the security domain
  3. Integrate multi-modal fusion of multi-source data such as logs and network traffic
  4. Study the real-time performance of models in streaming alert processing scenarios