Reading

model-speed-test: A Comprehensive Evaluation Tool for LLMs with OpenAI-Compatible APIs

An open-source LLM benchmarking tool that supports comprehensive evaluation of speed, visual understanding, tool calling, and reasoning capabilities for any OpenAI-compatible API, helping developers objectively compare the performance of different models and providers.

LLM基准测试OpenAI API性能评测工具调用视觉模型开源工具模型选型

Published 2026-06-13 23:10Recent activity 2026-06-13 23:21Estimated read 7 min

model-speed-test: A Comprehensive Evaluation Tool for LLMs with OpenAI-Compatible APIs

Section 01

[Main Post] model-speed-test: Guide to the Comprehensive Evaluation Tool for LLMs with OpenAI-Compatible APIs

Core Points

model-speed-test is an open-source LLM benchmarking tool that supports comprehensive evaluation of speed, visual understanding, tool calling, and reasoning capabilities for any OpenAI-compatible API, helping developers objectively compare the performance of different models and providers.

Original Author & Source

Original Author/Maintainer: 1chenmm
Source Platform: GitHub
Original Link: https://github.com/1chenmm/model-speed-test
Release Time/Update Time: 2026-06-13T15:10:11Z

Section 02

Project Background & Core Features

Project Overview

model-speed-test focuses on LLM performance evaluation, with the design goal of providing objective and reproducible benchmark results. Unlike tools that only focus on generation speed, it uses a multi-dimensional evaluation system, measuring model capabilities from four key aspects: inference speed, visual understanding, tool calling, and logical reasoning—closer to real-world application scenarios.

Core Features

The project's biggest feature is supporting any OpenAI-compatible API endpoint, including OpenAI services, third-party providers like Azure, and locally deployed inference servers such as vLLM and TGI. It allows horizontal comparison using the same set of standards, making it highly versatile.

Section 03

Detailed Evaluation Methods & Dimensions

Speed Test

Using tokens per second (TPS) as the metric, it measures the text generation throughput of the model. It supports configuring different concurrency levels and input/output lengths to simulate real-scenario load patterns.

Visual Understanding Evaluation

Evaluates the model's accuracy in understanding image content and response speed. By sending image-containing inputs, it checks the accuracy and completeness of descriptions, testing the quality of the visual encoder and multi-modal fusion efficiency.

Tool Calling Test

Simulates real scenarios and evaluates three aspects: calling accuracy (correctly identifying tools and generating formatted parameters), parameter extraction precision (extracting structured parameters from natural language), and calling timing judgment (only calling tools when necessary).

Reasoning Capability Evaluation

Through math calculation, logical reasoning, and common sense judgment questions, it distinguishes between memory models and reasoning models, helping developers assess whether a model is suitable for specific scenarios (e.g., math tutoring).

Section 04

Use Cases & Practical Recommendations

Applicable Scenarios

Technical decision-makers: Data-driven selection to avoid marketing misinformation;
Operations engineers: Regular benchmarking to detect service degradation in time;
Researchers: Standardized results for paper citations and peer comparisons.

Practical Recommendations

Establish a fixed test baseline: Test the main model weekly with the same parameters and record TPS trends;
Customize test cases based on business scenarios: Add business-related samples to get targeted evaluation results.

Section 05

Technical Architecture & Extensibility

Architecture Design

Adopts a modular design where the four test dimensions are relatively independent. They can be enabled/disabled on demand, lowering the barrier to use and facilitating the expansion of new dimensions.

Technical Implementation

Developed based on Python with clear dependency management and simple deployment. Test results are output in a structured format, making it easy to integrate into CI/CD pipelines or data visualization platforms.

Section 06

Summary & Industry Significance

Summary

The emergence of model-speed-test reflects the shift of the LLM ecosystem from 'wild growth' to 'rational evaluation'. It provides an objective performance benchmark and advocates a data-driven selection culture.

Industry Significance

It is recommended that developers include it in their technical research process, obtain first-hand data through actual testing instead of relying on vendor promotions or community reputation, maintain an objective understanding of model capabilities, and make correct technical decisions.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23