# CTI Dataset: Open Source Release of a Hybrid Security Dataset for Cyber Threat Attribution

> A cybersecurity dataset with 26,930 records and 14 feature dimensions, integrating threat intelligence data such as attacker motivation, TTPs (Tactics, Techniques, and Procedures), malware families, toolchains, and target environments. It is specifically designed for cyber threat attribution research and machine learning classification tasks.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-18T07:45:25.000Z
- 最近活动: 2026-05-18T07:50:07.634Z
- 热度: 152.9
- 关键词: 威胁情报, 网络归因, CTA, 安全数据集, 机器学习, APT分析, TTP, 开源数据, 网络安全研究
- 页面链接: https://www.zingnex.cn/en/forum/thread/cti
- Canonical: https://www.zingnex.cn/forum/thread/cti
- Markdown 来源: floors_fallback

---

## Open Source Release of CTI Dataset: Empowering Cyber Threat Attribution Research and Machine Learning Applications

CTI-dataset is an open-source hybrid cybersecurity dataset with 26,930 records and 14 feature dimensions. It integrates multi-dimensional threat intelligence including attacker motivation, TTPs (Tactics, Techniques, and Procedures), malware families, etc. Specifically designed for cyber threat attribution research and machine learning classification tasks, it fills the gap in high-quality labeled attribution data and supports various application scenarios.

## Challenges in Cyber Threat Attribution and Current Status of Data Scarcity

As cyber attacks become increasingly complex, threat attribution (CTA), which accurately identifies the source of attacks, has become a key link in security defense. However, high-quality labeled data is scarce in the cybersecurity field, especially since threat attribution requires integrating multi-source intelligence and expert knowledge, which restricts the application of AI technology in this domain.

## Scale of CTI Dataset and 14-Dimensional Feature System

The CTI dataset has a scale of 26,930 records and includes 14 structured fields, covering four major dimensions: 
1. Attacker Profile: Motivation, skills, country of origin, sponsor; 
2. Attack Techniques: TTPs, execution operations, tools, malware; 
3. Target and Impact: Target country, organization, application, first discovery time, attack result; 
4. Attribution Label: The CTA field identifies threat actors, supporting supervised learning. The data format is CSV, which is easy to import into analysis tools.

## Application Scenarios of CTI Dataset and Tool Integration Capabilities

Application scenarios include: threat actor classification, intrusion detection research, cyber attribution modeling, AI-driven security analysis, malware prediction, threat hunting simulation, and security awareness training. In terms of technical compatibility, the CSV format supports the Python ecosystem (Pandas, Scikit-learn, etc.), SIEM platforms (Splunk, Elastic), and visualization tools (Tableau, Grafana). Sample code can quickly load the data.

## Research Value of CTI Dataset and Typical Data Examples

Research value: Reduces the threshold for academic research (no commercial subscription required), standardizes algorithm evaluation, supports education and training, and verifies the applicability of new algorithms. Typical data example: Motivation driven by political ideology, using Keylogger malware and Empire framework, attribution label is APT group DeepPanda. The structured representation facilitates analysis and algorithm training.

## Usage Restrictions and Ethical Guidelines of CTI Dataset

The dataset is explicitly limited to educational purposes, academic research, and defensive security research. It is strictly prohibited for malicious activities, reflecting the sensitivity and ethical considerations of cybersecurity data sharing.

## Future Optimization Directions of CTI Dataset

Continuous improvements will be made in the future: expanding data scale, enriching feature dimensions (e.g., time series, infrastructure information), improving label quality (adding confidence scores), aligning with standard frameworks such as MITRE ATT&CK, and establishing a real-time data update mechanism.