# Application of Large Language Models in SOC Alert Classification and Priority Ranking: Potential and Limitations

> An empirical study on the performance of mainstream large language models like GPT-4o and DeepSeek in SOC alert processing, revealing AI's potential in threat detection and challenges in priority ranking.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-11T18:52:05.000Z
- 最近活动: 2026-05-11T18:59:07.185Z
- 热度: 145.9
- 关键词: 大语言模型, 安全运营中心, SOC, 告警分类, 威胁检测, GPT-4o, DeepSeek, 网络安全, AI安全, 告警疲劳
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-c0deing-llm-soc-alert-triage
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-c0deing-llm-soc-alert-triage
- Markdown 来源: floors_fallback

---

## [Introduction] Study on the Potential and Limitations of Large Language Models in SOC Alert Processing

This study conducts an empirical investigation into the performance of mainstream large language models (LLMs) such as GPT-4o and DeepSeek in the tasks of alert classification and priority ranking in Security Operations Centers (SOCs). The results show that LLMs exhibit high recall potential in alert classification tasks but have a high false positive rate; their performance in priority ranking tasks is significantly insufficient. The research conclusion points out that AI should serve as an auxiliary tool for SOC analysts, and human-machine collaboration is needed to balance automation and manual judgment to improve operational efficiency.

## Research Background and Motivation: The Challenge of SOC Alert Fatigue

## Research Background and Motivation
In the digital transformation era, enterprises face complex cyber threats. As the security nerve center, SOCs need to process massive alerts every day—large enterprises handle an average of thousands of alerts daily, most of which are false positives, leading to "alert fatigue" that consumes analysts' time and may cause real threats to be missed. Traditional alert classification and priority ranking rely on manual experience, which is time-consuming, labor-intensive, and prone to subjective bias. With the rise of LLMs, the industry is exploring the integration of AI into SOC processes. This study aims to evaluate the actual performance of general-purpose LLMs in these tasks to provide references for security teams.

## Experimental Environment and Research Methods

## Experimental Environment and Dataset Construction
A simulated SOC environment was built with core components including Wazuh (SIEM), Suricata (IDS), Windows Server 2019 domain environment, Windows 11 clients, Linux vulnerable application servers, and Kali Linux attack machines. Alerts were triggered using the Atomic Red Team framework plus manual attacks, and 178 alert records in JSON format were exported:
- Classification dimension: 104 real alerts, 74 false positives
- Priority dimension: 136 low, 12 medium, 25 high, 5 critical

## Research Methods and Technical Route
Four phases:
1. **Preprocessing**: Scripts clean and label data, merging real/false positive alerts into a unified dataset
2. **Model Inference**: Test 7 mainstream models (GPT-4o series, DeepSeek-Chat/Reasoner) to complete classification (real/false positive) and priority ranking tasks
3. **Postprocessing**: Standardize model results
4. **Evaluation**: Analyze using metrics like accuracy, precision, recall, etc.

List of tested models:
| Model Name | Version/Snapshot |
|---------|----------|
| GPT-4o | gpt-4o-2024-08-06 |
| GPT-4.1 | gpt-4.1-2025-04-14 |
| GPT-4.5 Preview | gpt-4.5-preview-2025-02-27 |
| GPT-4o mini | gpt-4o-mini-2024-07-18 |
| GPT-4.1 mini | gpt-4.1-mini-2024-04-14 |
| DeepSeek-Chat | DeepSeek-V3-0324 |
| DeepSeek-Reasoner | DeepSeek-R1-0528 |

## Key Findings: Classification Has Potential, Priority Ranking Needs Improvement

## Alert Classification: AI Shows Strong Potential
All models performed well in distinguishing real vs. false positive alerts:
- **Best Performance**: GPT-4o mini achieved a recall rate of 95.19% (low missed alert rate)
- **Challenge**: GPT-4o mini had a false positive rate of 72.97%, which may still lead to alert fatigue
- Trade-off: In security scenarios, the cost of missed alerts is higher than false positives, so high recall is reasonable, but the false positive rate needs to be balanced

## Priority Ranking: A Clear Shortcoming of Current Models
All models performed poorly:
- **Best Performance**: GPT-4.1 had a macro-average recall rate of 34.59% and an accuracy of 49.44% (slightly better than random guesses)
- **Reasons**: Need for organization-specific context, subjective priority classification, unbalanced data distribution (low priority accounts for 76%)
- Insight: AI priority recommendations need manual confirmation; human-machine collaboration is more reliable

## Practical Insights and Application Recommendations

## Insights for SOC Operations
Layered application strategy:
1. **Alert Pre-screening**: Use AI's high recall feature to filter obvious false positives, reducing manual review volume
2. **Real Threat Confirmation**: AI marks high-confidence real alerts for quick response; edge cases are reviewed manually
3. **Priority Assistance**: AI outputs serve as references, combined with rules and manual judgment

## Technical Implementation Recommendations
1. Start with small-scale pilots and expand gradually
2. Establish a feedback loop to optimize models using analysts' judgments
3. Focus on cost-effectiveness and select models with the best cost-performance ratio
4. Maintain manual supervision; position AI as an assistant rather than a replacement

## Research Limitations and Future Directions

## Research Limitations
1. The experimental environment is simulated, which may differ from actual production environments
2. The dataset size is small (178 records), covering a limited range of security event types

## Future Directions
1. Expand to larger-scale and diverse alert datasets
2. Explore the performance of models fine-tuned for the security domain
3. Integrate multi-modal fusion of multi-source data such as logs and network traffic
4. Study the real-time performance of models in streaming alert processing scenarios
