# Model Evaluator: A Local LLM Reasoning Ability Evaluation Framework for the Security Domain

> A local LLM evaluation tool designed specifically for security scenarios, supporting seven-dimensional reasoning ability testing of Ollama local models, using the LLM-as-Judge mode to automatically score and generate visual reports.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T23:54:01.000Z
- 最近活动: 2026-05-19T00:21:03.361Z
- 热度: 141.6
- 关键词: LLM评估, Ollama, 安全代理, 推理能力, LLM-as-Judge, 渗透测试, 离线评估, 模型选型
- 页面链接: https://www.zingnex.cn/en/forum/thread/model-evaluator-llm
- Canonical: https://www.zingnex.cn/forum/thread/model-evaluator-llm
- Markdown 来源: floors_fallback

---

## Model Evaluator: Introduction to the Local LLM Reasoning Ability Evaluation Framework for the Security Domain

Model Evaluator is a local LLM evaluation tool designed specifically for security scenarios. It supports seven-dimensional reasoning ability testing of Ollama local models, uses the LLM-as-Judge mode to automatically score and generate visual reports. It aims to provide data support for model selection of security agents and penetration testing tools, addressing the need for systematic evaluation of LLM reasoning abilities in security-critical scenarios.

## Project Background and Design Objectives

Before deploying large language models to security-critical scenarios, their reasoning abilities need to be systematically evaluated. Model Evaluator focuses on key reasoning abilities in security scenarios (such as abductive reasoning, hallucination resistance, etc.), distinguishing itself from general LLM benchmark tests, and provides a basis for the selection of security agents and penetration testing tools.

## Core Architecture and Evaluation Methods

**Architecture**: Dual-file driven design. `eval_harness.py` is responsible for running tests, LLM-as-Judge scoring, and report generation; `probe_builder.py` supports custom security scenario test case expansion.
**Seven-dimensional Reasoning Abilities**: Includes Chain of Thought (1.5×), Abductive (2.0×), Analogical (1.5×), Counterfactual (1.0×), Causal Chain (1.5×), Hallucination Resistance (2.0×), Self-Correction (1.0×). Among these, abductive reasoning and hallucination resistance have the highest weights.
**LLM-as-Judge Mechanism**: The model under test answers probe questions → the judging model scores according to standards → generates a comprehensive report, ensuring scoring consistency.

## Fully Offline Design and Quick Start

**Offline Design**: All models run in the local Ollama environment, no external API calls are made, and data remains on localhost, meeting enterprise security compliance requirements.
**Quick Usage**: 
1. Environment preparation: `pip install -r requirements.txt` + pull models (e.g., mistral, mixtral);
2. Run evaluation: Supports commands for comparing multiple models, testing specific dimensions, specifying judging models, etc.;
3. Output: Generates JSON results, CSV summaries, visual charts, and detailed reports.

## Custom Probes and Score Interpretation

**Custom Probes**: Proprietary security scenario test cases can be added via `probe_builder.py`, supporting operations such as saving examples, interactive creation, format verification, etc.
**Score Interpretation**: 8-10 (Excellent, production-ready), 6-8 (Good, needs prompt optimization), 4-6 (Average, needs fine-tuning), 0-4 (Poor, not suitable for security agents). Key suggestion: Prioritize abductive reasoning and hallucination resistance scores, as they directly affect the reliability of security agents.

## Applicable Scenarios and Project Value

**Applicable Scenarios**: Security tool selection, model iteration verification, private deployment evaluation, security research.
**Conclusion**: Model Evaluator fills the gap in LLM evaluation tools for the security domain, provides a data-driven decision-making basis, and is of great value for applications such as building security agents and automated penetration testing tools.
