Zing Forum

Reading

llm-eval: A Self-Hosted Evaluation Framework for Local Large Language Models

A self-hosted evaluation system designed specifically for local large language models. It supports multi-dimensional capability testing via llama.cpp's OpenAI-compatible endpoint, covering core abilities such as reasoning, programming, code quality, instruction following, long context, and writing. It also provides two difficulty levels (basic and difficult) and a comparison feature for enabling/disabling the thinking mode.

LLM评估本地模型llama.cpp模型对比推理测试代码生成开源工具
Published 2026-05-14 04:07Recent activity 2026-05-14 04:20Estimated read 5 min
llm-eval: A Self-Hosted Evaluation Framework for Local Large Language Models
1

Section 01

llm-eval: Core Guide to the Self-Hosted Evaluation Framework for Local Large Language Models

llm-eval is a self-hosted evaluation framework designed specifically for local large language models. Built on llama.cpp's OpenAI-compatible endpoint, it supports multi-dimensional capability testing (reasoning, programming, code quality, etc.), provides two difficulty levels (basic/difficult) and a comparison feature for enabling/disabling the thinking mode, helping developers quickly and reliably evaluate model capabilities in a local environment.

2

Section 02

Project Background and Core Objectives

llm-eval fills the gap in local LLM evaluation tools. Its core objective is to help developers and researchers quickly and reliably evaluate the actual capabilities of different models in a local environment. Compared to cloud API evaluation solutions, it supports fully localized offline operation, protects data privacy, and allows users to obtain a trustworthy profile of model capabilities.

3

Section 03

Evaluation Methods and Test Capability Dimensions

Core Design Philosophy

  • Reproducible comparison: Fixed prompt set + programmatic scoring to ensure consistent and comparable results;
  • Layered difficulty: Basic level (baseline check) and difficult level (to distinguish gaps between top models);
  • Thinking mode comparison: Supports performance comparison of the same model with thinking mode enabled/disabled.

Test Capability Dimensions

Covers 7 core capabilities: reasoning, programming, code quality, instruction following, long context retrieval, writing, and tool calling.

Evaluation Mechanism

  • Primarily programmatic scoring: Automated verification of values, code unit testing, format checks, etc.;
  • Supplementary scale scoring: Structured scale manual scoring is used for dimensions like writing.
4

Section 04

Evaluation Results of Mainstream Models and Key Findings

The project has tested models such as Gemma-4-26B-A4B, Gemma-4-31B, Qwen3.6-35B-A3B, Qwen3.5-122B-A10B:

  • The Gemma series has an approximately 98% pass rate at the basic level; the 31B dense model slightly outperforms at the difficult level, while the 26B sparse model has an issue of overthinking leading to truncation;
  • Qwen3.6-35B-A3B ranks third;
  • Qwen3.5-122B-A10B lags behind Qwen3.6 (with only one-fourth the number of parameters) due to its aggressive Q3 quantization strategy, highlighting the importance of quantization strategies.
5

Section 05

Local Evaluation Usage Process

  1. Start the model service using llama.cpp, enabling Jinja template support to obtain reasoning traces;
  2. Run the evaluation script, specifying the model label, test capability scope, and thinking mode;
  3. Use the report generation script to convert results into a comparison report; The entire process can be completed locally offline, protecting data privacy.
6

Section 06

Project Limitations and Usage Recommendations

Limitations

The current version does not test long-cycle Agent loops, multi-step toolchains, multi-file collaboration, or innovative comprehensive tasks.

Usage Recommendations

Excellent evaluation results do not mean the model is suitable for all scenarios; users need to comprehensively judge the model's applicability based on actual needs.