Zing Forum

Reading

Open-source LLM Automated Evaluation Framework: A Local Benchmarking Solution Without API Keys

This article introduces an open-source LLM automated evaluation framework that supports comprehensive assessment of models like LLaMA, Mistral, and Phi-2 in terms of reasoning ability, latency, throughput, and memory usage. It enables automated continuous benchmarking and leaderboard updates via GitHub Actions.

LLM 评测基准测试开源模型HuggingFaceGitHub Actions自动化测试模型排行榜性能评估
Published 2026-04-12 12:41Recent activity 2026-04-12 13:24Estimated read 6 min
Open-source LLM Automated Evaluation Framework: A Local Benchmarking Solution Without API Keys
1

Section 01

Introduction to the Open-source LLM Automated Evaluation Framework: A Local Benchmarking Solution Without API Keys

This article presents an open-source LLM automated evaluation framework that supports comprehensive assessment of models such as LLaMA, Mistral, and Phi-2 in reasoning ability, latency, throughput, and memory usage. Built on HuggingFace Transformers and running locally, it requires no commercial API keys. Through GitHub Actions, it enables automated continuous benchmarking and leaderboard updates, addressing issues in open-source model evaluation like environmental differences, inconsistent standards, redundant work, and lack of transparency.

2

Section 02

Project Background and Motivation

With the explosive growth of open-source large language models, developers face difficulties in model selection. While commercial API services offer standardized evaluations, open-source model evaluation has many challenges: performance inconsistencies due to environmental differences, inconsistent evaluation standards, resource waste from repeated tool building, and lack of credibility due to irreproducible results. This framework aims to provide a complete automated benchmarking solution that runs locally without API keys.

3

Section 03

Core Evaluation Metrics

The framework evaluates models from four dimensions:

  1. Reasoning Ability Score: Assessed through 10 keyword-matching tasks (arithmetic, logic, common sense, sequence reasoning, etc.), where the score is the ratio of correctly completed tasks.
  2. Latency Performance: Measures the time to generate up to 50 tokens, including average latency, P50, and P90 latency.
  3. Token Throughput: Number of tokens generated per second, based on 3 independent tests.
  4. Memory Usage: RSS increment (MB) before and after model loading.
4

Section 04

Technical Architecture and Automation Mechanism

Project Structure: Includes CI workflows, main evaluation scripts, leaderboard generation scripts, model registry, result files, etc. Inference Engine: Uses HuggingFace Transformers, supports CPU/GPU, zero cost, controllable, privacy-safe, and easy to extend. Model Classification: ci_safe (e.g., distilgpt2), ci_borderline (e.g., gpt2-medium), local_only (e.g., Phi-2, Mistral-7B). GitHub Actions Automation: Triggered by code changes, scheduled tasks (every Sunday at 2 AM UTC), or manual triggers; automatically commits result files (raw data, leaderboard JSON, and Markdown).

5

Section 05

Local Usage and Community Contribution

Local Usage:

  • Basic evaluation: After installing dependencies, run run_benchmark.py (CI-safe models) to generate the leaderboard.
  • Large model evaluation: e.g., Phi-2 (requires 6GB memory), Mistral 7B (requires Ollama). Community Contribution: Fork the repository → add model configuration → local evaluation → submit a PR to expand the leaderboard to include more models.
6

Section 06

Application Scenarios

The framework is suitable for:

  1. Model Selection: Refer to the leaderboard to balance reasoning ability, speed, and memory usage.
  2. Performance Regression Testing: CI automated continuous evaluation to detect performance degradation in a timely manner.
  3. Hardware Selection: Memory usage data helps assess hardware compatibility.
  4. Academic Research: Standardized metrics and reproducible results provide a reliable data foundation.
7

Section 07

Limitations and Future Improvement Directions

Current Limitations: Reasoning ability depends on keyword matching, short text generation (≤50 tokens), and a single CI hardware environment. Future Improvements: Introduce complex tasks (multi-step reasoning, code generation), support long text evaluation, collect multi-hardware data to build prediction models, and integrate more inference backends (vLLM, TensorRT-LLM).