Zing Forum

Reading

GPUSCALE: A Benchmarking Platform for LLM Inference in Large-Scale GPU Selection and Rental

GPUSCALE is a GPU benchmarking project for large-scale AI workloads, designed to provide data support for GPU procurement and rental decisions. The project supports local GPUs and cloud GPU services (Vast.ai, RunPod), and collects key metrics including tokens per second, first token latency, VRAM usage, and power consumption through standardized containerized testing processes.

GPU基准测试LLM推理云GPUVast.aiRunPod性能优化硬件选型llama.cppvLLM
Published 2026-04-16 05:35Recent activity 2026-04-16 05:51Estimated read 8 min
GPUSCALE: A Benchmarking Platform for LLM Inference in Large-Scale GPU Selection and Rental
1

Section 01

GPUSCALE Project Introduction: A Benchmarking Platform for LLM Inference in Large-Scale GPU Selection and Rental

GPUSCALE is a GPU benchmarking project for large-scale AI workloads, aiming to provide data support for GPU procurement and rental decisions. The project supports local GPUs and cloud GPU services (Vast.ai, RunPod), and collects key metrics such as tokens per second, first token latency, VRAM usage, and power consumption through standardized containerized testing processes. It helps AI service providers and researchers make informed decisions and provides a reference benchmark for the design of new accelerators.

2

Section 02

Project Background and Motivation

With the widespread application of LLMs across various industries, GPUs have become core resources for AI infrastructure. However, the market has a wide range of GPU models and cloud rental services, and developers and enterprises lack reliable performance reference data. Existing benchmarks are either too simplified or lack optimization for LLM inference scenarios. The goal of GPUSCALE is to establish a public GPU performance database similar to Blender Open Data, providing trustworthy results for AI-related GPU tasks and supporting large-scale procurement/rental decisions and new accelerator design.

3

Section 03

Architecture Design and Core Components

GPUSCALE adopts a modular architecture, consisting of four core components:

  1. S3-Attach: Manages private model weights (e.g., Meta's original Llama weights) stored in Wasabi S3 buckets; public models are pulled directly from the HuggingFace Hub.
  2. Virt-Runner: The test execution engine, responsible for infrastructure configuration, containerized testing, result collection, and resource release, supporting cloud (Vast.ai/RunPod) and local GPU testing.
  3. DBOps: A CLI tool that validates, formats, and submits results to the Supabase database to ensure data integrity.
  4. Results-Disp: A public leaderboard that displays results and supports multi-dimensional filtering and comparison.
4

Section 04

Benchmarking Methodology

Containerized Standardization

All tests are executed in standardized Docker containers, with fixed inference engines (llama.cpp, vLLM), CUDA versions, and metric tools to ensure a consistent software stack.

Inference Engine Selection

  • llama.cpp: Suitable for CPU/GPU inference, GGUF models, single-GPU consumer hardware; lightweight and ideal for edge deployment.
  • vLLM: Optimized specifically for GPUs, supports full-weight/GPTQ models and multi-GPU setups, providing production-grade performance.

Key Performance Metrics

Metric Category Specific Metric Data Source
Throughput Tokens per second (generation phase) Engine statistics
Latency Time to First Token (TTFT) Engine statistics
Processing Speed Prompt evaluation rate Engine statistics
VRAM Usage Peak VRAM consumption nvidia-smi
Power Consumption GPU TDP/Power consumption nvidia-smi
Utilization Average and peak GPU utilization nvidia-smi
Thermal Characteristics GPU temperature nvidia-smi
Overall Total benchmark runtime Testing framework

Standardized Workloads

A standardized set of prompts with fixed parameters is used, and workload definitions and parameters are stored as metadata to ensure result comparability.

5

Section 05

Special Considerations for Local Testing

Cloud instances run Linux, and containerization ensures a consistent environment; local testing is affected by the operating system, kernel, and drivers, so metadata needs to be recorded:

  • Operating system and distribution (e.g., Ubuntu 24.04, Windows 11 + WSL2)
  • Kernel version (e.g., 6.8.0-45-generic)
  • Host NVIDIA driver version (e.g., 550.54.14)
  • Docker runtime version (e.g., nvidia-container-toolkit 1.16.1) This metadata is stored along with the results to facilitate distinguishing results from different local environments.
6

Section 06

Practical Application Value

GPUSCALE provides data support for AI infrastructure decisions:

  1. Procurement Decisions: Compare the performance of GPU models under LLM workloads to select cost-effective configurations.
  2. Rental Optimization: Compare the performance and price of cloud service provider instances to find configurations suitable for specific scenarios.
  3. Capacity Planning: Predict the GPU resources required for different scale deployments based on performance data.
  4. Technology Selection: Evaluate the performance differences between llama.cpp and vLLM on specific hardware.
  5. Trend Tracking: Establish a historical database to track the evolution of GPU performance.
7

Section 07

Summary and Outlook

GPUSCALE provides a trustworthy reference for GPU selection in LLM inference scenarios through a systematic benchmarking methodology and an open collaboration model. Containerized and standardized processes ensure comparable and reproducible results, while the modular architecture supports flexible expansion. As AI workloads grow, this platform will play an important role in hardware selection and infrastructure planning. The community can jointly contribute data, improve methodologies, and establish a comprehensive and authoritative AI GPU performance database.