Reading

Real-World LLM Inference Testing on Consumer Hardware: Quantization Isn't Always Better

An open-source LLM inference cost benchmark for consumer hardware (CUDA/Apple Silicon/CPU) reveals counterintuitive results where quantization may backfire on Apple Silicon.

benchmarkllminferencequantizationapple-silicontransformersconsumer-hardware

Published 2026-06-12 01:45Recent activity 2026-06-12 01:48Estimated read 8 min

Real-World LLM Inference Testing on Consumer Hardware: Quantization Isn't Always Better

Section 01

[Introduction] Real-World LLM Inference Testing on Consumer Hardware: Counterintuitive Quantization Results on Apple Silicon

The transformers-laptop-bench project developed by original author Valerio Maggio (GitHub link: https://github.com/leriomaggio/transformers-laptop-bench) conducts open-source LLM inference cost benchmarking for consumer hardware (CUDA/Apple Silicon/CPU). The core finding is: On Apple Silicon, quantization not only fails to improve performance but also significantly reduces inference speed and even increases memory usage—contrary to common intuition. The tests cover metrics like time-to-first-token, total latency, throughput, and peak memory, aiming to provide ordinary users with real data references for running LLMs locally.

Section 02

Background: Why Do We Need LLM Inference Benchmarks for Consumer Hardware?

With the rapid development of open-source LLMs, developers want to run models locally, but most benchmarks focus on data center hardware, lacking real, reproducible data for consumer laptops. This project aims to provide an honest, reproducible benchmark framework to help users understand the real costs of running open-source instruction-tuned models locally, covering three backends: CUDA, Apple Silicon (MPS), and CPU. Metrics measured include time-to-first-token, total generation latency, throughput, and peak memory usage.

Section 03

Testing Methods and Core Metrics Explanation

Measured Metrics: Time-to-First-Token (TTFT, p50/p95), total generation latency (p50/p95), throughput (tokens/s), peak memory, and model loading time (recorded separately). Test Design: Greedy decoding, fixed number of output tokens, warm-up runs (not included in results), random seeds, and statistical values from multiple measurements. Memory Measurement Honesty: For CUDA, torch.cuda.max_memory_allocated is used (only tensor VRAM); for MPS/CPU, psutil-sampled RSS is used (including interpreter, libraries, etc.). Memory data across backends cannot be directly compared.

Section 04

Counterintuitive Finding: Real-World Data of Quantization Backfiring on Apple Silicon

Test results for SmolLM2-1.7B-Instruct (128-token output) on Apple M3 Pro show:

Configuration	Time-to-First-Token (p50)	Throughput (tokens/s)	Peak Memory (MB)
bf16	0.063s	28.2	3302
int8	0.237s	4.6	3594
int4	0.893s	1.1	3706
It is clear that bf16 precision is the fastest; int8/int4 quantization leads to a sharp drop in speed (int8 is 6x slower), and memory usage increases instead of decreasing.

Section 05

Reason Analysis: Why Does Quantization Perform Poorly on Apple Silicon?

Reasons for poor quantization performance on Apple Silicon:

Lack of dedicated kernels: The quanto weight-only quantization scheme has no optimized kernels for the MPS backend; matrix multiplication requires dequantizing weights back to bf16 for computation.
Computational overhead: Step-by-step dequantization causes performance loss, and working memory remains at bf16 size—no speed or memory advantages.
Additional int4 burden: Relies on C++ extensions that run partially on the CPU, further slowing down speed.

Section 06

Practical Recommendations and Benchmark Insights

Recommendations for Apple Silicon Users:

If the model can fit in memory with bf16, prioritize bf16.
Use quanto quantization only to run models that can't fit in memory—not to speed up models that already run.
Don't quantize blindly; actual measurement is more important. Benchmark Insights:
Cross-platform comparisons need caution; backend implementation details affect performance.
Memory measurement methods are inconsistent; there are implementation differences behind the numbers.
The value of open-source benchmarks lies in reproducible real data, not leaderboard scores.

Section 07

Project Technical Details: Supported Models and Runtime Environment

Supported Models: Default is HuggingFaceTB/SmolLM2-1.7B-Instruct; alternative is Qwen/Qwen2.5-1.5B-Instruct. Runtime Environment: Python3.13, PyTorch2.12.0, Transformers5.11.0, optimum-quanto0.2.7. Configuration Flexibility: Default parameters are configured via TOML files, which can be overridden via command line; available backends are detected automatically.

Section 08

Conclusion: The Value of Honest Measurement and Project Significance

The transformers-laptop-bench project not only provides a practical benchmark tool but also demonstrates the value of honest measurement in machine learning engineering:

Platform differences are critical; CUDA optimization strategies may not apply to other platforms.
Performance optimization needs to be based on actual data, not theoretical inference.
Transparent methodology is more important than pretty numbers. This project provides a reliable starting point for developers running LLMs locally, helping them make data-driven hardware and configuration decisions.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23