Zing Forum

Reading

Hybrid Multi-Agent Architecture: Enhancing CodeQL Static Analysis with LLM, 4x F1 Score Improvement

This cybersecurity master's thesis proposes an innovative three-agent hybrid architecture that combines large language models (LLMs) with the CodeQL static analysis tool. The Analyzer agent validates CodeQL results, the Suggestor agent identifies coverage gaps, and the Creator agent generates new queries. On a Python vulnerability dataset, this approach achieves a 4x improvement in F1 score from 0.11 to 0.43.

CodeQLSASTLLMStatic AnalysisVulnerability DetectionMulti-AgentDevSecOpsSecurity
Published 2026-04-10 17:00Recent activity 2026-04-10 17:20Estimated read 5 min
Hybrid Multi-Agent Architecture: Enhancing CodeQL Static Analysis with LLM, 4x F1 Score Improvement
1

Section 01

【Introduction】Hybrid Multi-Agent Architecture: Core Breakthroughs in LLM-Enhanced CodeQL Static Analysis

This article proposes an innovative three-agent hybrid architecture that combines LLMs with CodeQL to address the limitations of traditional SAST tools. Through a closed loop formed by the Analyzer, Suggestor, and Creator agents, it achieves a 4x improvement in F1 score from 0.11 to 0.43 on a Python vulnerability dataset, while retaining CodeQL's determinism and auditability.

2

Section 02

【Background】Dilemmas of Static Analysis Tools and the Necessity of Hybrid Solutions

SAST tools like CodeQL have two major limitations: lack of contextual reasoning leading to false positives and inability to detect new vulnerability patterns; pure LLM approaches face issues with reproducibility, cost, and DevSecOps integration. There is a need to explore hybrid solutions that retain CodeQL's advantages while leveraging LLM's enhancement capabilities.

3

Section 03

【Methodology】Design Details of the Three-Agent Hybrid Architecture

The system includes three specialized agents:

  1. Analyzer Agent: Runs CodeQL to parse results, and uses LLM to validate alerts (judging true vulnerabilities based on source code context);
  2. Suggestor Agent: Analyzes CodeQL coverage gaps (false negatives) and generates structured improvement proposals (e.g., missing source/sink points);
  3. Creator Agent: Converts proposals into CodeQL queries and attempts compilation validation. The design retains CodeQL's determinism while using LLM to handle contextual reasoning tasks.
4

Section 04

【Evidence】Experimental Results and Performance Evaluation

Dataset: 27 Python vulnerability files covering CWE-78 (7), CWE-89 (10), CWE-79 (10). Performance Results:

System Precision Recall F1 Score
Analyzer Agent 0.667 0.320 0.432
Baseline CodeQL 0.167 0.080 0.108
The F1 score improved by approximately 4x.
LLM-as-Judge Evaluation: The average quality of Suggestor is 4.78/5, and the query quality of Creator is 3.0/5 (lower quality for CWE-78 generation).
5

Section 05

【Limitations and Outlook】Current Shortcomings and Future Directions

Limitations: Generated queries require manual syntax adjustments; only covers 3 types of CWE; small dataset size; only supports Python. Future Directions: Improve Creator's code generation capability; expand to more CWEs and programming languages; integrate into CI/CD pipelines; explore efficient prompt engineering.

6

Section 06

【Industry Implications】Significance of Hybrid Architecture for Security Tool Development

  1. Hybrid is Better Than Replacement: LLM serves as an enhancement layer, retaining the auditability and interpretability of traditional tools;
  2. Agent Specialization: Agents with clear division of labor are more effective than general-purpose agents;
  3. Human-AI Collaboration: Generated queries need manual refinement, reflecting AI assistance rather than replacement;
  4. Integratability: Compatible with CodeQL CLI, seamlessly integrating into existing DevSecOps workflows.