Zing Forum

Reading

Model Evaluator: A Local LLM Reasoning Ability Evaluation Framework for the Security Domain

A local LLM evaluation tool designed specifically for security scenarios, supporting seven-dimensional reasoning ability testing of Ollama local models, using the LLM-as-Judge mode to automatically score and generate visual reports.

LLM评估Ollama安全代理推理能力LLM-as-Judge渗透测试离线评估模型选型
Published 2026-05-19 07:54Recent activity 2026-05-19 08:21Estimated read 5 min
Model Evaluator: A Local LLM Reasoning Ability Evaluation Framework for the Security Domain
1

Section 01

Model Evaluator: Introduction to the Local LLM Reasoning Ability Evaluation Framework for the Security Domain

Model Evaluator is a local LLM evaluation tool designed specifically for security scenarios. It supports seven-dimensional reasoning ability testing of Ollama local models, uses the LLM-as-Judge mode to automatically score and generate visual reports. It aims to provide data support for model selection of security agents and penetration testing tools, addressing the need for systematic evaluation of LLM reasoning abilities in security-critical scenarios.

2

Section 02

Project Background and Design Objectives

Before deploying large language models to security-critical scenarios, their reasoning abilities need to be systematically evaluated. Model Evaluator focuses on key reasoning abilities in security scenarios (such as abductive reasoning, hallucination resistance, etc.), distinguishing itself from general LLM benchmark tests, and provides a basis for the selection of security agents and penetration testing tools.

3

Section 03

Core Architecture and Evaluation Methods

Architecture: Dual-file driven design. eval_harness.py is responsible for running tests, LLM-as-Judge scoring, and report generation; probe_builder.py supports custom security scenario test case expansion. Seven-dimensional Reasoning Abilities: Includes Chain of Thought (1.5×), Abductive (2.0×), Analogical (1.5×), Counterfactual (1.0×), Causal Chain (1.5×), Hallucination Resistance (2.0×), Self-Correction (1.0×). Among these, abductive reasoning and hallucination resistance have the highest weights. LLM-as-Judge Mechanism: The model under test answers probe questions → the judging model scores according to standards → generates a comprehensive report, ensuring scoring consistency.

4

Section 04

Fully Offline Design and Quick Start

Offline Design: All models run in the local Ollama environment, no external API calls are made, and data remains on localhost, meeting enterprise security compliance requirements. Quick Usage:

  1. Environment preparation: pip install -r requirements.txt + pull models (e.g., mistral, mixtral);
  2. Run evaluation: Supports commands for comparing multiple models, testing specific dimensions, specifying judging models, etc.;
  3. Output: Generates JSON results, CSV summaries, visual charts, and detailed reports.
5

Section 05

Custom Probes and Score Interpretation

Custom Probes: Proprietary security scenario test cases can be added via probe_builder.py, supporting operations such as saving examples, interactive creation, format verification, etc. Score Interpretation: 8-10 (Excellent, production-ready), 6-8 (Good, needs prompt optimization), 4-6 (Average, needs fine-tuning), 0-4 (Poor, not suitable for security agents). Key suggestion: Prioritize abductive reasoning and hallucination resistance scores, as they directly affect the reliability of security agents.

6

Section 06

Applicable Scenarios and Project Value

Applicable Scenarios: Security tool selection, model iteration verification, private deployment evaluation, security research. Conclusion: Model Evaluator fills the gap in LLM evaluation tools for the security domain, provides a data-driven decision-making basis, and is of great value for applications such as building security agents and automated penetration testing tools.