Zing Forum

Reading

AutoInfer: A Hardware-Adaptive Inference Optimization Framework for Large Language Models

Inference optimization for large language models is often simplified to pursuing the highest token generation speed, ignoring the quality loss caused by quantization. AutoInfer introduces the concept of quality-adjusted throughput and uses Bayesian optimization to automatically find the optimal balance between speed and quality, enabling each GPU to maximize its performance.

大语言模型推理优化贝叶斯优化量化GPU加速llama.cpp模型部署性能调优Pareto优化
Published 2026-03-28 21:13Recent activity 2026-03-28 21:20Estimated read 7 min
AutoInfer: A Hardware-Adaptive Inference Optimization Framework for Large Language Models
1

Section 01

AutoInfer: Core Guide to the Hardware-Adaptive LLM Inference Optimization Framework

AutoInfer is a hardware-adaptive inference optimization framework for large language models, designed to address the problem of overemphasizing token generation speed while ignoring quality loss in inference optimization. It introduces the quality-adjusted throughput (tok/s × quality_score) metric and uses Bayesian optimization to automatically find the optimal balance between speed and quality, allowing each GPU to maximize its performance.

2

Section 02

Myths of Inference Optimization: The Pitfall of Speed-First and the Dilemma of Manual Parameter Tuning

In the actual deployment of large language models, inference optimization often falls into the trap of overfocusing on token generation speed (tok/s) while neglecting output quality. For example, the IQ2_M quantized model running at 21.6 tok/s may have worse performance due to perplexity degradation than the Q3_K_M version at 12.3 tok/s. Additionally, manual parameter tuning lacks replicability; the optimal configuration varies with hardware models, quantization levels, and driver versions, requiring tedious re-search every time changes are made.

3

Section 03

Quality-Adjusted Throughput: A New Optimization Metric for Balancing Speed and Quality

AutoInfer proposes quality-adjusted throughput as the optimization target, calculated as tok/s × quality_score, which explicitly balances speed and quality. The quality score is measured by perplexity (lower values indicate higher generation quality), and Pareto frontier analysis is used to find the maximum throughput under a given quality threshold or the optimal quality configuration for a target speed.

4

Section 04

Full Process of Bayesian Optimization-Driven Automatic Parameter Search

The core of AutoInfer is a parameter search framework based on Bayesian optimization, with the process including: 1. Hardware Profiling: Automatically detect GPU memory, RAM, CPU core count, and storage speed to establish a baseline; 2. Parameter Space Definition: Cover GPU layer offloading count, batch size, micro-batch size, CPU thread count, KV cache quantization type, Flash Attention enablement status, etc., with hardware constraints; 3. Bayesian Optimization Search: Use the Optuna TPE sampler for efficient exploration with 50+ trials; 4. Comprehensive Evaluation: Measure speed and perplexity, supporting multiple backends; 5. Pareto Analysis: Generate a quality-speed trade-off curve to select the optimal operating point.

5

Section 05

700+ Experiments Validate: Key Findings on Quantization and Parameter Interactions

AutoInfer conducted over 700 experiments based on the Qwen3.5-35B-A3B model (covering Q3_K_M, IQ2_M, IQ3_S quantization levels), revealing key interaction relationships: Increasing the number of GPU layers usually improves speed, but performance drops when approaching the memory limit; Large batches increase throughput but add latency; The effect of Flash Attention varies by configuration. Bayesian optimization can automatically learn these non-linear relationships without manual preset rules.

6

Section 06

Guide to Using AutoInfer's Command-Line Tool

AutoInfer provides an intuitive command-line interface, with a typical workflow: 1. Hardware Profiling: autoinfer profile outputs a hardware summary; adding --json --storage gives a detailed report; 2. Optimization Command: autoinfer optimize --model models/Qwen3.5-35B-A3B-Q3_K_M.gguf --bench ./target/release/bench --corpus benchmarks/wikitext_sample.txt --trials 50 --target-quality 0.95 --output results.tsv; 3. Analysis Command: autoinfer analyze results_phase9.tsv results_phase10.tsv results_phase11.tsv generates Pareto curves and configuration recommendations.

7

Section 07

Multi-Scenario Application Value of AutoInfer

AutoInfer is suitable for multiple scenarios: Individual users eliminate the hassle of manual parameter tuning, allowing consumer GPUs to deliver optimal performance; Enterprise deployments provide replicable optimization processes, reducing operational burdens; Model developers gain insights into deployment characteristics through Pareto curves, guiding quantization strategies and architecture design.

8

Section 08

Conclusion: From Experience-Driven to Data-Driven Inference Optimization

AutoInfer represents the shift of LLM inference optimization from experience-driven to data-driven, automatically finding the optimal configuration through systematic experiments and Bayesian optimization. It introduces quality-adjusted throughput to correct speed bias and helps find the balance between speed and quality. As LLMs evolve, such tools will become a key part of infrastructure, promoting the widespread application of LLMs.