Reading

AutoInfer: A Hardware-Adaptive Inference Optimization Framework for Large Language Models

Inference optimization for large language models is often simplified to pursuing the highest token generation speed, ignoring the quality loss caused by quantization. AutoInfer introduces the concept of quality-adjusted throughput and uses Bayesian optimization to automatically find the optimal balance between speed and quality, enabling each GPU to maximize its performance.

大语言模型推理优化贝叶斯优化量化GPU加速llama.cpp模型部署性能调优Pareto优化

Published 2026-03-28 21:13Recent activity 2026-03-28 21:20Estimated read 7 min

AutoInfer: A Hardware-Adaptive Inference Optimization Framework for Large Language Models

Section 01

AutoInfer: Core Guide to the Hardware-Adaptive LLM Inference Optimization Framework

AutoInfer is a hardware-adaptive inference optimization framework for large language models, designed to address the problem of overemphasizing token generation speed while ignoring quality loss in inference optimization. It introduces the quality-adjusted throughput (tok/s × quality_score) metric and uses Bayesian optimization to automatically find the optimal balance between speed and quality, allowing each GPU to maximize its performance.

Section 02

Myths of Inference Optimization: The Pitfall of Speed-First and the Dilemma of Manual Parameter Tuning

In the actual deployment of large language models, inference optimization often falls into the trap of overfocusing on token generation speed (tok/s) while neglecting output quality. For example, the IQ2_M quantized model running at 21.6 tok/s may have worse performance due to perplexity degradation than the Q3_K_M version at 12.3 tok/s. Additionally, manual parameter tuning lacks replicability; the optimal configuration varies with hardware models, quantization levels, and driver versions, requiring tedious re-search every time changes are made.

Section 03

Quality-Adjusted Throughput: A New Optimization Metric for Balancing Speed and Quality

AutoInfer proposes quality-adjusted throughput as the optimization target, calculated as tok/s × quality_score, which explicitly balances speed and quality. The quality score is measured by perplexity (lower values indicate higher generation quality), and Pareto frontier analysis is used to find the maximum throughput under a given quality threshold or the optimal quality configuration for a target speed.

Section 04

Full Process of Bayesian Optimization-Driven Automatic Parameter Search

The core of AutoInfer is a parameter search framework based on Bayesian optimization, with the process including: 1. Hardware Profiling: Automatically detect GPU memory, RAM, CPU core count, and storage speed to establish a baseline; 2. Parameter Space Definition: Cover GPU layer offloading count, batch size, micro-batch size, CPU thread count, KV cache quantization type, Flash Attention enablement status, etc., with hardware constraints; 3. Bayesian Optimization Search: Use the Optuna TPE sampler for efficient exploration with 50+ trials; 4. Comprehensive Evaluation: Measure speed and perplexity, supporting multiple backends; 5. Pareto Analysis: Generate a quality-speed trade-off curve to select the optimal operating point.

Section 05

700+ Experiments Validate: Key Findings on Quantization and Parameter Interactions

AutoInfer conducted over 700 experiments based on the Qwen3.5-35B-A3B model (covering Q3_K_M, IQ2_M, IQ3_S quantization levels), revealing key interaction relationships: Increasing the number of GPU layers usually improves speed, but performance drops when approaching the memory limit; Large batches increase throughput but add latency; The effect of Flash Attention varies by configuration. Bayesian optimization can automatically learn these non-linear relationships without manual preset rules.

Section 06

Guide to Using AutoInfer's Command-Line Tool

AutoInfer provides an intuitive command-line interface, with a typical workflow: 1. Hardware Profiling: autoinfer profile outputs a hardware summary; adding --json --storage gives a detailed report; 2. Optimization Command: autoinfer optimize --model models/Qwen3.5-35B-A3B-Q3_K_M.gguf --bench ./target/release/bench --corpus benchmarks/wikitext_sample.txt --trials 50 --target-quality 0.95 --output results.tsv; 3. Analysis Command: autoinfer analyze results_phase9.tsv results_phase10.tsv results_phase11.tsv generates Pareto curves and configuration recommendations.

Section 07

Multi-Scenario Application Value of AutoInfer

AutoInfer is suitable for multiple scenarios: Individual users eliminate the hassle of manual parameter tuning, allowing consumer GPUs to deliver optimal performance; Enterprise deployments provide replicable optimization processes, reducing operational burdens; Model developers gain insights into deployment characteristics through Pareto curves, guiding quantization strategies and architecture design.

Section 08

Conclusion: From Experience-Driven to Data-Driven Inference Optimization

AutoInfer represents the shift of LLM inference optimization from experience-driven to data-driven, automatically finding the optimal configuration through systematic experiments and Bayesian optimization. It introduces quality-adjusted throughput to correct speed bias and helps find the balance between speed and quality. As LLMs evolve, such tools will become a key part of infrastructure, promoting the widespread application of LLMs.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15