Reading

CTI Dataset: Open Source Release of a Hybrid Security Dataset for Cyber Threat Attribution

A cybersecurity dataset with 26,930 records and 14 feature dimensions, integrating threat intelligence data such as attacker motivation, TTPs (Tactics, Techniques, and Procedures), malware families, toolchains, and target environments. It is specifically designed for cyber threat attribution research and machine learning classification tasks.

威胁情报网络归因CTA安全数据集机器学习APT分析TTP开源数据网络安全研究

Published 2026-05-18 15:45Recent activity 2026-05-18 15:50Estimated read 5 min

CTI Dataset: Open Source Release of a Hybrid Security Dataset for Cyber Threat Attribution

Section 01

Open Source Release of CTI Dataset: Empowering Cyber Threat Attribution Research and Machine Learning Applications

CTI-dataset is an open-source hybrid cybersecurity dataset with 26,930 records and 14 feature dimensions. It integrates multi-dimensional threat intelligence including attacker motivation, TTPs (Tactics, Techniques, and Procedures), malware families, etc. Specifically designed for cyber threat attribution research and machine learning classification tasks, it fills the gap in high-quality labeled attribution data and supports various application scenarios.

Section 02

Challenges in Cyber Threat Attribution and Current Status of Data Scarcity

As cyber attacks become increasingly complex, threat attribution (CTA), which accurately identifies the source of attacks, has become a key link in security defense. However, high-quality labeled data is scarce in the cybersecurity field, especially since threat attribution requires integrating multi-source intelligence and expert knowledge, which restricts the application of AI technology in this domain.

Section 03

Scale of CTI Dataset and 14-Dimensional Feature System

The CTI dataset has a scale of 26,930 records and includes 14 structured fields, covering four major dimensions:

Attacker Profile: Motivation, skills, country of origin, sponsor;
Attack Techniques: TTPs, execution operations, tools, malware;
Target and Impact: Target country, organization, application, first discovery time, attack result;
Attribution Label: The CTA field identifies threat actors, supporting supervised learning. The data format is CSV, which is easy to import into analysis tools.

Section 04

Application Scenarios of CTI Dataset and Tool Integration Capabilities

Application scenarios include: threat actor classification, intrusion detection research, cyber attribution modeling, AI-driven security analysis, malware prediction, threat hunting simulation, and security awareness training. In terms of technical compatibility, the CSV format supports the Python ecosystem (Pandas, Scikit-learn, etc.), SIEM platforms (Splunk, Elastic), and visualization tools (Tableau, Grafana). Sample code can quickly load the data.

Section 05

Research Value of CTI Dataset and Typical Data Examples

Research value: Reduces the threshold for academic research (no commercial subscription required), standardizes algorithm evaluation, supports education and training, and verifies the applicability of new algorithms. Typical data example: Motivation driven by political ideology, using Keylogger malware and Empire framework, attribution label is APT group DeepPanda. The structured representation facilitates analysis and algorithm training.

Section 06

Usage Restrictions and Ethical Guidelines of CTI Dataset

The dataset is explicitly limited to educational purposes, academic research, and defensive security research. It is strictly prohibited for malicious activities, reflecting the sensitivity and ethical considerations of cybersecurity data sharing.

Section 07

Future Optimization Directions of CTI Dataset

Continuous improvements will be made in the future: expanding data scale, enriching feature dimensions (e.g., time series, infrastructure information), improving label quality (adding confidence scores), aligning with standard frameworks such as MITRE ATT&CK, and establishing a real-time data update mechanism.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54