Zing Forum

Reading

Comprehensive Evaluation Framework for Open-Source Large Language Models: Automated Benchmarking Based on LLM-as-a-Judge

A reusable open-source LLM evaluation framework that supports automated benchmarking across multi-dimensional tasks including reasoning, programming, multilingual capabilities, security, and structured generation, combining performance metrics with LLM-as-a-Judge quality scores.

LLM评估基准测试模型对比LLM-as-a-Judge性能测试开源模型自动化评估
Published 2026-05-29 15:40Recent activity 2026-05-29 15:53Estimated read 7 min
Comprehensive Evaluation Framework for Open-Source Large Language Models: Automated Benchmarking Based on LLM-as-a-Judge
1

Section 01

Comprehensive Open-Source LLM Evaluation Framework: Core Value & Guide

This article introduces a reusable open-source LLM evaluation framework that supports automated benchmarking across multi-dimensional tasks including reasoning, programming, multilingual capabilities, security, and structured generation. The framework combines performance metrics (latency, throughput, etc.) with LLM-as-a-Judge quality scores to provide data-driven model selection decision support for developers and researchers. The project covers comparative evaluations of 3 open-source models, presenting results through standardized processes and an interactive dashboard.

2

Section 02

Project Background & Motivation

With the rapid development of open-source large language models, developers face the challenge of model selection—different models perform differently in latency, response quality, multilingual capabilities, etc., while official benchmarks struggle to fully reflect real-world needs. Existing evaluation tools have limitations: narrow test coverage, lack of unified standards, high manual costs, and separation between performance and quality metrics. This project aims to build a reusable framework to address these issues through standardized prompts, LLM-as-a-Judge mechanisms, and an interactive dashboard.

3

Section 03

Core Evaluation Dimensions & Methodology

Core Dimensions: The framework designs 5 key dimensions: reasoning ability (logic/mathematics/common sense), programming ability (code generation/algorithm implementation), structured output (JSON Schema compliance), multilingual ability (Hindi/Gujarati/Hinglish), security (jailbreak resistance/prompt injection defense). Methodology:

  • Test design: 5 prompts per dimension, 3 temperature parameters, totaling 225 runs (25 prompts ×3 models ×3 temperatures);
  • Performance metrics: Collect TTFT (Time to First Token), total latency, throughput, cost estimation;
  • LLM-as-Judge: Use llama-3.3-70b-versatile (temperature 0.0) to evaluate quality from correctness, instruction following, clarity, completeness, and overall score (1-10 points).
4

Section 04

Experimental Results & Key Findings

Model Comparison: Evaluations were conducted on llama-3.1-8b-instant, qwen/qwen3-32b, openai/gpt-oss-120b:

Model Average Latency Time to First Token Throughput Quality Score
llama-3.1-8b-instant 667ms ✅ 219ms 213t/s✅ 8.62/10
qwen/qwen3-32b 3564ms❌ 1421ms 201t/s 8.70/10
openai/gpt-oss-120b 1248ms 398ms 130t/s 9.
Key Insights:
  1. Speed: Llama3.1-8B has an average latency of 667ms, 5.5x faster than Qwen3-32B;
  2. Quality: GPT-OSS 120B has an overall score of 9.36/10, with full marks in reasoning/programming tasks;
  3. Cost-effectiveness of structured output: Llama3.1-8B and GPT-OSS tied for full marks, with Llama3.1-8B being 2x faster;
  4. Security: Qwen3-32B scored the highest (8.80), GPT-OSS the lowest (8.13)—scale ≠ security;
  5. Cost: Llama3.1-8B's cost is far lower than GPT-OSS, achieving 92% of its quality level.
5

Section 05

Technical Implementation & Fairness Assurance

Project Structure: Includes files like prompts.json (prompts), benchmark_runner.py (main runner), dashboard.html (interactive dashboard), etc. Tech Stack: Python3.10+, Groq SDK, python-dotenv, Chart.js, native HTML/CSS/JS. Usage Flow: Install dependencies → Configure API keys → Run tests → View dashboard (supports resume from breakpoints, rate limit handling). Fairness: Unified Groq LPU hardware, standardized prompts, 3 temperature samples, consistent llama-3.3-70b-versatile judge model to ensure comparable results.