Reading

Prefill-intensive LLM Inference Auto-tuning: An Analysis of the heavy-prefill-bench Benchmark Suite

An in-depth interpretation of the heavy-prefill-bench project, exploring how to optimize the throughput efficiency and cost-effectiveness of long-context LLM inference through automated parameter scanning and cost-normalized metrics.

LLM InferencePrefill OptimizationSGLangBenchmarkGPUThroughputCost EfficiencyvLLM

Published 2026-04-27 07:45Recent activity 2026-04-27 07:50Estimated read 6 min

Prefill-intensive LLM Inference Auto-tuning: An Analysis of the heavy-prefill-bench Benchmark Suite

Section 01

【Introduction】heavy-prefill-bench: A Benchmark Suite for Auto-tuning Prefill-intensive LLM Inference

In long-context Large Language Model (LLM) inference, the Prefill phase (processing input prompts) often becomes a performance bottleneck. This article analyzes the open-source benchmark suite heavy-prefill-bench, which helps optimize the throughput efficiency and cost-effectiveness of long-context LLM inference through automated parameter scanning and cost-normalized metrics. It supports frameworks like SGLang, assisting teams in finding the optimal combination of hardware, models, and configurations.

Section 02

Background: The Necessity of Prefill Optimization and Limitations of Traditional Benchmarks

Modern LLM applications (such as code completion, document Q&A, and RAG) feature long input-short output, batch processing, and cost sensitivity. Traditional benchmarks mostly focus on short contexts or balanced input-output ratios, making it difficult to reflect real long-context production loads. heavy-prefill-bench is designed to fill this gap.

Section 03

Core Methods: Auto-tuner and Key Design

Auto-tuner Features

Parameter Scanning: Systematically scan the chunked_prefill_size parameter to find the optimal throughput point;
Zero HTTP Overhead: Use SGLang's built-in modules to avoid network layer interference;
Automatic GPU Detection: Identify models via nvidia-smi and embed data to ensure traceability.

Configuration System

Includes workload definition (input/output length, number of requests, etc.), model and quantization strategies, and hardware cost tracking (GPU hourly cost, etc.).

Key Metrics

Request-level: requests_per_sec, etc.;
Token-level: input_tokens_per_sec, etc.;
Cost efficiency: tokens_per_dollar (core metric for cross-hardware/provider comparison).

Section 04

Insights from Measured Data: Impact of Hardware and Parameters

Consumer GPU Memory Wall

RTX4090 (24GB) running Qwen2.5-7B (bf16) triggers OOM when chunked_prefill_size exceeds 8192, making capacity a constraint.

Sweet Spot Parameter Differences

For Qwen2.5-7B/14B (bf16) on H100, larger chunks lead to higher throughput, with 32768 being optimal;
For Qwen2.5-32B (fp8), the opposite is true—2048 is the sweet spot.

Cost Efficiency Comparison

RTX4090 running Phi-4-mini can reach about 54 million tokens per dollar, while H100 running Qwen2.5-7B achieves about 15 million tokens per dollar. Consumer GPUs are more cost-effective for small model scenarios.

Section 05

Engineering Practice Recommendations: Key Points for Tuning and Deployment

Cost Tracking: Record complete pricing information (provider, instance type, etc.) to avoid comparison pitfalls;
Load Adaptation: Use representative workload configurations for scanning instead of generic benchmarks;
OOM Prevention: Reserve 10-15% GPU memory headroom to avoid production interruptions;
Quantization Trade-off: fp8 reduces memory but may change the optimal chunk size—actual testing is required.

Section 06

Extension Integration and Conclusion: From Experience-driven to Data-driven

Extension Support

Frameworks: Reserved integration interfaces for vLLM and TensorRT-LLM;
Output: CSV/JSONL formats for easy data analysis;
Metadata: Complete configurations and pricing are written into JSON to support trend analysis.

Conclusion

heavy-prefill-bench promotes the shift of LLM inference optimization from experience-driven to data-driven. Through systematic scanning and cost normalization, it becomes an essential tool for production tuning of long-context applications.