Zing Forum

Reading

CTI Dataset: Open Source Release of a Hybrid Security Dataset for Cyber Threat Attribution

A cybersecurity dataset with 26,930 records and 14 feature dimensions, integrating threat intelligence data such as attacker motivation, TTPs (Tactics, Techniques, and Procedures), malware families, toolchains, and target environments. It is specifically designed for cyber threat attribution research and machine learning classification tasks.

威胁情报网络归因CTA安全数据集机器学习APT分析TTP开源数据网络安全研究
Published 2026-05-18 15:45Recent activity 2026-05-18 15:50Estimated read 5 min
CTI Dataset: Open Source Release of a Hybrid Security Dataset for Cyber Threat Attribution
1

Section 01

Open Source Release of CTI Dataset: Empowering Cyber Threat Attribution Research and Machine Learning Applications

CTI-dataset is an open-source hybrid cybersecurity dataset with 26,930 records and 14 feature dimensions. It integrates multi-dimensional threat intelligence including attacker motivation, TTPs (Tactics, Techniques, and Procedures), malware families, etc. Specifically designed for cyber threat attribution research and machine learning classification tasks, it fills the gap in high-quality labeled attribution data and supports various application scenarios.

2

Section 02

Challenges in Cyber Threat Attribution and Current Status of Data Scarcity

As cyber attacks become increasingly complex, threat attribution (CTA), which accurately identifies the source of attacks, has become a key link in security defense. However, high-quality labeled data is scarce in the cybersecurity field, especially since threat attribution requires integrating multi-source intelligence and expert knowledge, which restricts the application of AI technology in this domain.

3

Section 03

Scale of CTI Dataset and 14-Dimensional Feature System

The CTI dataset has a scale of 26,930 records and includes 14 structured fields, covering four major dimensions:

  1. Attacker Profile: Motivation, skills, country of origin, sponsor;
  2. Attack Techniques: TTPs, execution operations, tools, malware;
  3. Target and Impact: Target country, organization, application, first discovery time, attack result;
  4. Attribution Label: The CTA field identifies threat actors, supporting supervised learning. The data format is CSV, which is easy to import into analysis tools.
4

Section 04

Application Scenarios of CTI Dataset and Tool Integration Capabilities

Application scenarios include: threat actor classification, intrusion detection research, cyber attribution modeling, AI-driven security analysis, malware prediction, threat hunting simulation, and security awareness training. In terms of technical compatibility, the CSV format supports the Python ecosystem (Pandas, Scikit-learn, etc.), SIEM platforms (Splunk, Elastic), and visualization tools (Tableau, Grafana). Sample code can quickly load the data.

5

Section 05

Research Value of CTI Dataset and Typical Data Examples

Research value: Reduces the threshold for academic research (no commercial subscription required), standardizes algorithm evaluation, supports education and training, and verifies the applicability of new algorithms. Typical data example: Motivation driven by political ideology, using Keylogger malware and Empire framework, attribution label is APT group DeepPanda. The structured representation facilitates analysis and algorithm training.

6

Section 06

Usage Restrictions and Ethical Guidelines of CTI Dataset

The dataset is explicitly limited to educational purposes, academic research, and defensive security research. It is strictly prohibited for malicious activities, reflecting the sensitivity and ethical considerations of cybersecurity data sharing.

7

Section 07

Future Optimization Directions of CTI Dataset

Continuous improvements will be made in the future: expanding data scale, enriching feature dimensions (e.g., time series, infrastructure information), improving label quality (adding confidence scores), aligning with standard frameworks such as MITRE ATT&CK, and establishing a real-time data update mechanism.