Zing Forum

Reading

Real-World LLM Inference Testing on Consumer Hardware: Quantization Isn't Always Better

An open-source LLM inference cost benchmark for consumer hardware (CUDA/Apple Silicon/CPU) reveals counterintuitive results where quantization may backfire on Apple Silicon.

benchmarkllminferencequantizationapple-silicontransformersconsumer-hardware
Published 2026-06-12 01:45Recent activity 2026-06-12 01:48Estimated read 8 min
Real-World LLM Inference Testing on Consumer Hardware: Quantization Isn't Always Better
1

Section 01

[Introduction] Real-World LLM Inference Testing on Consumer Hardware: Counterintuitive Quantization Results on Apple Silicon

The transformers-laptop-bench project developed by original author Valerio Maggio (GitHub link: https://github.com/leriomaggio/transformers-laptop-bench) conducts open-source LLM inference cost benchmarking for consumer hardware (CUDA/Apple Silicon/CPU). The core finding is: On Apple Silicon, quantization not only fails to improve performance but also significantly reduces inference speed and even increases memory usage—contrary to common intuition. The tests cover metrics like time-to-first-token, total latency, throughput, and peak memory, aiming to provide ordinary users with real data references for running LLMs locally.

2

Section 02

Background: Why Do We Need LLM Inference Benchmarks for Consumer Hardware?

With the rapid development of open-source LLMs, developers want to run models locally, but most benchmarks focus on data center hardware, lacking real, reproducible data for consumer laptops. This project aims to provide an honest, reproducible benchmark framework to help users understand the real costs of running open-source instruction-tuned models locally, covering three backends: CUDA, Apple Silicon (MPS), and CPU. Metrics measured include time-to-first-token, total generation latency, throughput, and peak memory usage.

3

Section 03

Testing Methods and Core Metrics Explanation

Measured Metrics: Time-to-First-Token (TTFT, p50/p95), total generation latency (p50/p95), throughput (tokens/s), peak memory, and model loading time (recorded separately). Test Design: Greedy decoding, fixed number of output tokens, warm-up runs (not included in results), random seeds, and statistical values from multiple measurements. Memory Measurement Honesty: For CUDA, torch.cuda.max_memory_allocated is used (only tensor VRAM); for MPS/CPU, psutil-sampled RSS is used (including interpreter, libraries, etc.). Memory data across backends cannot be directly compared.

4

Section 04

Counterintuitive Finding: Real-World Data of Quantization Backfiring on Apple Silicon

Test results for SmolLM2-1.7B-Instruct (128-token output) on Apple M3 Pro show:

Configuration Time-to-First-Token (p50) Throughput (tokens/s) Peak Memory (MB)
bf16 0.063s 28.2 3302
int8 0.237s 4.6 3594
int4 0.893s 1.1 3706
It is clear that bf16 precision is the fastest; int8/int4 quantization leads to a sharp drop in speed (int8 is 6x slower), and memory usage increases instead of decreasing.
5

Section 05

Reason Analysis: Why Does Quantization Perform Poorly on Apple Silicon?

Reasons for poor quantization performance on Apple Silicon:

  1. Lack of dedicated kernels: The quanto weight-only quantization scheme has no optimized kernels for the MPS backend; matrix multiplication requires dequantizing weights back to bf16 for computation.
  2. Computational overhead: Step-by-step dequantization causes performance loss, and working memory remains at bf16 size—no speed or memory advantages.
  3. Additional int4 burden: Relies on C++ extensions that run partially on the CPU, further slowing down speed.
6

Section 06

Practical Recommendations and Benchmark Insights

Recommendations for Apple Silicon Users:

  • If the model can fit in memory with bf16, prioritize bf16.
  • Use quanto quantization only to run models that can't fit in memory—not to speed up models that already run.
  • Don't quantize blindly; actual measurement is more important. Benchmark Insights:
  • Cross-platform comparisons need caution; backend implementation details affect performance.
  • Memory measurement methods are inconsistent; there are implementation differences behind the numbers.
  • The value of open-source benchmarks lies in reproducible real data, not leaderboard scores.
7

Section 07

Project Technical Details: Supported Models and Runtime Environment

Supported Models: Default is HuggingFaceTB/SmolLM2-1.7B-Instruct; alternative is Qwen/Qwen2.5-1.5B-Instruct. Runtime Environment: Python3.13, PyTorch2.12.0, Transformers5.11.0, optimum-quanto0.2.7. Configuration Flexibility: Default parameters are configured via TOML files, which can be overridden via command line; available backends are detected automatically.

8

Section 08

Conclusion: The Value of Honest Measurement and Project Significance

The transformers-laptop-bench project not only provides a practical benchmark tool but also demonstrates the value of honest measurement in machine learning engineering:

  1. Platform differences are critical; CUDA optimization strategies may not apply to other platforms.
  2. Performance optimization needs to be based on actual data, not theoretical inference.
  3. Transparent methodology is more important than pretty numbers. This project provides a reliable starting point for developers running LLMs locally, helping them make data-driven hardware and configuration decisions.