# AutoInfer: A Hardware-Adaptive Inference Optimization Framework for Large Language Models

> Inference optimization for large language models is often simplified to pursuing the highest token generation speed, ignoring the quality loss caused by quantization. AutoInfer introduces the concept of quality-adjusted throughput and uses Bayesian optimization to automatically find the optimal balance between speed and quality, enabling each GPU to maximize its performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-28T13:13:23.000Z
- 最近活动: 2026-03-28T13:20:02.182Z
- 热度: 161.9
- 关键词: 大语言模型, 推理优化, 贝叶斯优化, 量化, GPU加速, llama.cpp, 模型部署, 性能调优, Pareto优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/autoinfer
- Canonical: https://www.zingnex.cn/forum/thread/autoinfer
- Markdown 来源: floors_fallback

---

## AutoInfer: Core Guide to the Hardware-Adaptive LLM Inference Optimization Framework

AutoInfer is a hardware-adaptive inference optimization framework for large language models, designed to address the problem of overemphasizing token generation speed while ignoring quality loss in inference optimization. It introduces the quality-adjusted throughput (tok/s × quality_score) metric and uses Bayesian optimization to automatically find the optimal balance between speed and quality, allowing each GPU to maximize its performance.

## Myths of Inference Optimization: The Pitfall of Speed-First and the Dilemma of Manual Parameter Tuning

In the actual deployment of large language models, inference optimization often falls into the trap of overfocusing on token generation speed (tok/s) while neglecting output quality. For example, the IQ2_M quantized model running at 21.6 tok/s may have worse performance due to perplexity degradation than the Q3_K_M version at 12.3 tok/s. Additionally, manual parameter tuning lacks replicability; the optimal configuration varies with hardware models, quantization levels, and driver versions, requiring tedious re-search every time changes are made.

## Quality-Adjusted Throughput: A New Optimization Metric for Balancing Speed and Quality

AutoInfer proposes quality-adjusted throughput as the optimization target, calculated as tok/s × quality_score, which explicitly balances speed and quality. The quality score is measured by perplexity (lower values indicate higher generation quality), and Pareto frontier analysis is used to find the maximum throughput under a given quality threshold or the optimal quality configuration for a target speed.

## Full Process of Bayesian Optimization-Driven Automatic Parameter Search

The core of AutoInfer is a parameter search framework based on Bayesian optimization, with the process including: 1. Hardware Profiling: Automatically detect GPU memory, RAM, CPU core count, and storage speed to establish a baseline; 2. Parameter Space Definition: Cover GPU layer offloading count, batch size, micro-batch size, CPU thread count, KV cache quantization type, Flash Attention enablement status, etc., with hardware constraints; 3. Bayesian Optimization Search: Use the Optuna TPE sampler for efficient exploration with 50+ trials; 4. Comprehensive Evaluation: Measure speed and perplexity, supporting multiple backends; 5. Pareto Analysis: Generate a quality-speed trade-off curve to select the optimal operating point.

## 700+ Experiments Validate: Key Findings on Quantization and Parameter Interactions

AutoInfer conducted over 700 experiments based on the Qwen3.5-35B-A3B model (covering Q3_K_M, IQ2_M, IQ3_S quantization levels), revealing key interaction relationships: Increasing the number of GPU layers usually improves speed, but performance drops when approaching the memory limit; Large batches increase throughput but add latency; The effect of Flash Attention varies by configuration. Bayesian optimization can automatically learn these non-linear relationships without manual preset rules.

## Guide to Using AutoInfer's Command-Line Tool

AutoInfer provides an intuitive command-line interface, with a typical workflow: 1. Hardware Profiling: `autoinfer profile` outputs a hardware summary; adding `--json --storage` gives a detailed report; 2. Optimization Command: `autoinfer optimize --model models/Qwen3.5-35B-A3B-Q3_K_M.gguf --bench ./target/release/bench --corpus benchmarks/wikitext_sample.txt --trials 50 --target-quality 0.95 --output results.tsv`; 3. Analysis Command: `autoinfer analyze results_phase9.tsv results_phase10.tsv results_phase11.tsv` generates Pareto curves and configuration recommendations.

## Multi-Scenario Application Value of AutoInfer

AutoInfer is suitable for multiple scenarios: Individual users eliminate the hassle of manual parameter tuning, allowing consumer GPUs to deliver optimal performance; Enterprise deployments provide replicable optimization processes, reducing operational burdens; Model developers gain insights into deployment characteristics through Pareto curves, guiding quantization strategies and architecture design.

## Conclusion: From Experience-Driven to Data-Driven Inference Optimization

AutoInfer represents the shift of LLM inference optimization from experience-driven to data-driven, automatically finding the optimal configuration through systematic experiments and Bayesian optimization. It introduces quality-adjusted throughput to correct speed bias and helps find the balance between speed and quality. As LLMs evolve, such tools will become a key part of infrastructure, promoting the widespread application of LLMs.
