# llm-eval: A Self-Hosted Evaluation Framework for Local Large Language Models

> A self-hosted evaluation system designed specifically for local large language models. It supports multi-dimensional capability testing via llama.cpp's OpenAI-compatible endpoint, covering core abilities such as reasoning, programming, code quality, instruction following, long context, and writing. It also provides two difficulty levels (basic and difficult) and a comparison feature for enabling/disabling the thinking mode.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-13T20:07:40.000Z
- 最近活动: 2026-05-13T20:20:06.231Z
- 热度: 139.8
- 关键词: LLM评估, 本地模型, llama.cpp, 模型对比, 推理测试, 代码生成, 开源工具
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-eval-51906015
- Canonical: https://www.zingnex.cn/forum/thread/llm-eval-51906015
- Markdown 来源: floors_fallback

---

## llm-eval: Core Guide to the Self-Hosted Evaluation Framework for Local Large Language Models

llm-eval is a self-hosted evaluation framework designed specifically for local large language models. Built on llama.cpp's OpenAI-compatible endpoint, it supports multi-dimensional capability testing (reasoning, programming, code quality, etc.), provides two difficulty levels (basic/difficult) and a comparison feature for enabling/disabling the thinking mode, helping developers quickly and reliably evaluate model capabilities in a local environment.

## Project Background and Core Objectives

llm-eval fills the gap in local LLM evaluation tools. Its core objective is to help developers and researchers quickly and reliably evaluate the actual capabilities of different models in a local environment. Compared to cloud API evaluation solutions, it supports fully localized offline operation, protects data privacy, and allows users to obtain a trustworthy profile of model capabilities.

## Evaluation Methods and Test Capability Dimensions

### Core Design Philosophy
- Reproducible comparison: Fixed prompt set + programmatic scoring to ensure consistent and comparable results;
- Layered difficulty: Basic level (baseline check) and difficult level (to distinguish gaps between top models);
- Thinking mode comparison: Supports performance comparison of the same model with thinking mode enabled/disabled.

### Test Capability Dimensions
Covers 7 core capabilities: reasoning, programming, code quality, instruction following, long context retrieval, writing, and tool calling.

### Evaluation Mechanism
- Primarily programmatic scoring: Automated verification of values, code unit testing, format checks, etc.;
- Supplementary scale scoring: Structured scale manual scoring is used for dimensions like writing.

## Evaluation Results of Mainstream Models and Key Findings

The project has tested models such as Gemma-4-26B-A4B, Gemma-4-31B, Qwen3.6-35B-A3B, Qwen3.5-122B-A10B:
- The Gemma series has an approximately 98% pass rate at the basic level; the 31B dense model slightly outperforms at the difficult level, while the 26B sparse model has an issue of overthinking leading to truncation;
- Qwen3.6-35B-A3B ranks third;
- Qwen3.5-122B-A10B lags behind Qwen3.6 (with only one-fourth the number of parameters) due to its aggressive Q3 quantization strategy, highlighting the importance of quantization strategies.

## Local Evaluation Usage Process

1. Start the model service using llama.cpp, enabling Jinja template support to obtain reasoning traces;
2. Run the evaluation script, specifying the model label, test capability scope, and thinking mode;
3. Use the report generation script to convert results into a comparison report;
The entire process can be completed locally offline, protecting data privacy.

## Project Limitations and Usage Recommendations

### Limitations
The current version does not test long-cycle Agent loops, multi-step toolchains, multi-file collaboration, or innovative comprehensive tasks.

### Usage Recommendations
Excellent evaluation results do not mean the model is suitable for all scenarios; users need to comprehensively judge the model's applicability based on actual needs.
