Zing Forum

Reading

BenchFlow: A Reproducible LLM Inference Benchmarking Framework for OpenShift

This article introduces the BenchFlow project, a control plane for LLM inference benchmarking specifically designed for OpenShift environments. It delves into its architectural design, single-cluster and multi-cluster deployment modes, Tekton pipeline integration, and collaboration mechanism with GuideLLM, providing a reference for teams needing model performance evaluation in Kubernetes environments.

LLM基准测试OpenShiftKubernetesTektonKueueGuideLLMMLOpsGPU调度
Published 2026-03-31 18:40Recent activity 2026-03-31 18:51Estimated read 6 min
BenchFlow: A Reproducible LLM Inference Benchmarking Framework for OpenShift
1

Section 01

[Introduction] BenchFlow: A Reproducible Control Plane Framework for LLM Inference Benchmarking in OpenShift Environments

This article introduces the BenchFlow project, a control plane for LLM inference benchmarking specifically designed for OpenShift environments. It addresses issues like poor environmental consistency and difficult resource scheduling in traditional benchmarking. Built on cloud-native components such as Tekton and Kueue, it supports single/multi-cluster deployment and matrix experiments, integrates GuideLLM and MLflow, and provides a reproducible and traceable solution for model performance evaluation in Kubernetes environments.

2

Section 02

Background: Core Challenges of LLM Benchmarking in OpenShift Environments

With the widespread production deployment of LLMs, accurate and reproducible inference performance evaluation has become critical. Traditional manual scripts struggle to ensure environmental consistency, and managing concurrent experiments is challenging. On enterprise-grade K8s platforms like OpenShift, additional challenges include:

  1. Environmental consistency: How to ensure the same test environment every time?
  2. Resource scheduling: How to coordinate GPU resource competition?
  3. Result tracking: How to systematically record and compare performance data?
  4. Multi-cluster management: How to unify testing when loads are distributed across multiple clusters? BenchFlow is a control plane framework designed to address these issues.
3

Section 03

Methodology and Architecture: Core Design and Deployment Modes of BenchFlow

BenchFlow is positioned as a "packaged control plane" rather than scattered scripts, providing full lifecycle management of experiments. Core dependencies include Tekton (execution pipelines), Kueue (resource scheduling), GuideLLM (load/metrics), and MLflow (experiment tracking). The core abstraction is RunPlan (an immutable execution plan to ensure reproducibility), which is converted into a Tekton PipelineRun during execution. Deployment modes:

  • Single cluster: Install all components via bflow bootstrap --single-cluster, with Kueue managing GPU admission;
  • Multi-cluster: The management cluster runs the control plane, while target clusters only need basic K8s/GPU support, with resource tracking via a remote capacity controller.
4

Section 04

Key Features: Matrix Experiments, GuideLLM Integration, and Result Tracking

BenchFlow supports matrix experiments: after users define parameter lists, it automatically generates all combination sub-executions (e.g., Cartesian product of different models, batch sizes, concurrency levels), with parallelism managed by Kueue. Integration with GuideLLM: Delegate to it for load generation and metric collection, with automatic setting of GUIDELLM_OUTPUT_DIR to ensure result consistency. Result tracking: After testing is completed, results are pushed to MLflow, supporting performance comparison, historical tracking, version association, and real-time monitoring with Grafana.

5

Section 05

Known Limitations and Future Improvement Directions

BenchFlow is currently in the experimental phase and has the following limitations:

  1. No cluster-level locks: Concurrent runs modifying cluster states are prone to race conditions;
  2. Serial execution of llm-d matrix: Limits the efficiency of large-scale parameter scanning;
  3. Parent execution cancellation limitation: Queued sub-executions need to be handled separately. These limitations provide clear directions for community contributions, and targeted optimizations can be made in the future.
6

Section 06

Technical Insights and Summary

BenchFlow embodies cloud-native best practices: declarative configuration (GitOps-friendly), layered architecture (separation of control/execution planes), resource-aware scheduling (Kueue integration), and observability-first (Grafana + MLflow). Summary: BenchFlow provides a structured solution for LLM inference benchmarking on OpenShift, establishing reproducible and traceable practices, and is an important reference framework for the LLM Ops toolchain.