Zing Forum

Reading

MACyber: Multi-Source Aligned Benchmark and Domain-Specific Large Language Model for Cybersecurity

The MACyber project provides the MACyber-INT multi-source aligned cybersecurity benchmark dataset and the MACyber-12B domain-specific large language model, covering seven key areas: network traffic, IoT, system logs, DNS, Web security, vulnerability intelligence, and threat intelligence. It offers a standardized toolset for evaluating AI models in the cybersecurity field.

网络安全基准测试大语言模型威胁情报RAGAI安全漏洞检测入侵检测
Published 2026-05-26 17:45Recent activity 2026-05-26 17:49Estimated read 7 min
MACyber: Multi-Source Aligned Benchmark and Domain-Specific Large Language Model for Cybersecurity
1

Section 01

Introduction / Main Floor: MACyber: Multi-Source Aligned Benchmark and Domain-Specific Large Language Model for Cybersecurity

The MACyber project provides the MACyber-INT multi-source aligned cybersecurity benchmark dataset and the MACyber-12B domain-specific large language model, covering seven key areas: network traffic, IoT, system logs, DNS, Web security, vulnerability intelligence, and threat intelligence. It offers a standardized toolset for evaluating AI models in the cybersecurity field.

2

Section 02

Original Author and Source

  • Original Author/Maintainer: qcydm
  • Source Platform: GitHub
  • Original Title: MACyber: Multi-Source Aligned Cybersecurity Benchmark (MACyber-INT) and Large Language Model (MACyber-12B)
  • Original Link: https://github.com/qcydm/MACyber
  • Publication Date: May 26, 2026
3

Section 03

Project Overview

MACyber is a comprehensive open-source project focused on the cybersecurity domain, consisting of two core components: the MACyber-INT benchmark dataset and the MACyber-12B large language model. The project aims to address the lack of standardized evaluation tools for AI models in cybersecurity, providing researchers and practitioners with a structured framework for evaluating security intelligence data.

In today's digital age, cybersecurity threats are becoming increasingly complex, and traditional rule-based security systems struggle to handle new attack methods. Large language models have great potential for applications in cybersecurity, but there is a lack of targeted benchmarks to assess their real capabilities. The MACyber project fills this gap by constructing a comprehensive evaluation system covering seven key security areas through multi-source data alignment.

4

Section 04

Benchmark Architecture

The MACyber-INT benchmark dataset includes 31 datasets, organized into seven high-level security domains:

5

Section 05

Seven Key Security Domains

  1. Network Traffic Security Covers threat detection at the network communication level, including scenarios like abnormal traffic identification and intrusion detection.

  2. IoT Security Addresses the specific security needs of IoT devices and evaluates models' capabilities in IoT threat identification.

  3. System Log Security Discovers potential security incidents and abnormal behaviors through system log analysis.

  4. DNS Security Threat Focuses on attack detection at the DNS level, including DNS tunneling and DDoS attacks.

  5. Web Security Threat Covers various attacks at the Web application level, such as SQL injection, XSS, CSRF, etc.

  6. Vulnerability Intelligence Evaluates models' understanding of known vulnerabilities and their ability to identify new vulnerabilities.

  7. Threat Intelligence Comprehensive threat information analysis, including attacker profiling and attack method identification.

6

Section 06

Data Schema Design

MACyber uses a structured JSON data schema, where each sample includes the following key fields:

  • Metadata (meta): Contains category and subcategory information for data classification and retrieval
  • Feature Data (json): Stores specific security features, such as network traffic features and log fields
  • Label Information (label): Includes official threat labels and severity levels (Benign/Suspicious/Low/Medium/High)
  • Reasoning Process (reasoning): Provides evidence chains and analysis logic, which is a key feature of MACyber
  • Response Recommendations (response): Includes suggested disposal actions (No Action/Monitor/Block) and their justifications

This design not only provides a standard input-output format but also includes an interpretable reasoning process, making model evaluation focus not only on result accuracy but also on the rationality of reasoning logic.

7

Section 07

MACyber-12B Model

The project also provides the MACyber-12B large language model, which is specifically trained for the cybersecurity domain. This model includes two important components:

8

Section 08

CyberLoRA

A low-rank adapter optimized for cybersecurity tasks. By injecting cybersecurity domain expertise into the base large model, it enhances the model's performance on security-related tasks.