# Real-World LLM Inference Testing on Consumer Hardware: Quantization Isn't Always Better

> An open-source LLM inference cost benchmark for consumer hardware (CUDA/Apple Silicon/CPU) reveals counterintuitive results where quantization may backfire on Apple Silicon.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T17:45:52.000Z
- 最近活动: 2026-06-11T17:48:09.921Z
- 热度: 158.0
- 关键词: benchmark, llm, inference, quantization, apple-silicon, transformers, consumer-hardware
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-leriomaggio-transformers-laptop-bench
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-leriomaggio-transformers-laptop-bench
- Markdown 来源: floors_fallback

---

## [Introduction] Real-World LLM Inference Testing on Consumer Hardware: Counterintuitive Quantization Results on Apple Silicon

The transformers-laptop-bench project developed by original author Valerio Maggio (GitHub link: https://github.com/leriomaggio/transformers-laptop-bench) conducts open-source LLM inference cost benchmarking for consumer hardware (CUDA/Apple Silicon/CPU). The core finding is: On Apple Silicon, quantization not only fails to improve performance but also significantly reduces inference speed and even increases memory usage—contrary to common intuition. The tests cover metrics like time-to-first-token, total latency, throughput, and peak memory, aiming to provide ordinary users with real data references for running LLMs locally.

## Background: Why Do We Need LLM Inference Benchmarks for Consumer Hardware?

With the rapid development of open-source LLMs, developers want to run models locally, but most benchmarks focus on data center hardware, lacking real, reproducible data for consumer laptops. This project aims to provide an honest, reproducible benchmark framework to help users understand the real costs of running open-source instruction-tuned models locally, covering three backends: CUDA, Apple Silicon (MPS), and CPU. Metrics measured include time-to-first-token, total generation latency, throughput, and peak memory usage.

## Testing Methods and Core Metrics Explanation

**Measured Metrics**: Time-to-First-Token (TTFT, p50/p95), total generation latency (p50/p95), throughput (tokens/s), peak memory, and model loading time (recorded separately).
**Test Design**: Greedy decoding, fixed number of output tokens, warm-up runs (not included in results), random seeds, and statistical values from multiple measurements.
**Memory Measurement Honesty**: For CUDA, torch.cuda.max_memory_allocated is used (only tensor VRAM); for MPS/CPU, psutil-sampled RSS is used (including interpreter, libraries, etc.). Memory data across backends cannot be directly compared.

## Counterintuitive Finding: Real-World Data of Quantization Backfiring on Apple Silicon

Test results for SmolLM2-1.7B-Instruct (128-token output) on Apple M3 Pro show:
| Configuration | Time-to-First-Token (p50) | Throughput (tokens/s) | Peak Memory (MB) |
|---------------|---------------------------|------------------------|-------------------|
| bf16          | 0.063s                    | 28.2                   | 3302              |
| int8          | 0.237s                    | 4.6                    | 3594              |
| int4          | 0.893s                    | 1.1                    | 3706              |
It is clear that bf16 precision is the fastest; int8/int4 quantization leads to a sharp drop in speed (int8 is 6x slower), and memory usage increases instead of decreasing.

## Reason Analysis: Why Does Quantization Perform Poorly on Apple Silicon?

Reasons for poor quantization performance on Apple Silicon:
1. **Lack of dedicated kernels**: The quanto weight-only quantization scheme has no optimized kernels for the MPS backend; matrix multiplication requires dequantizing weights back to bf16 for computation.
2. **Computational overhead**: Step-by-step dequantization causes performance loss, and working memory remains at bf16 size—no speed or memory advantages.
3. **Additional int4 burden**: Relies on C++ extensions that run partially on the CPU, further slowing down speed.

## Practical Recommendations and Benchmark Insights

**Recommendations for Apple Silicon Users**:
- If the model can fit in memory with bf16, prioritize bf16.
- Use quanto quantization only to run models that can't fit in memory—not to speed up models that already run.
- Don't quantize blindly; actual measurement is more important.
**Benchmark Insights**:
- Cross-platform comparisons need caution; backend implementation details affect performance.
- Memory measurement methods are inconsistent; there are implementation differences behind the numbers.
- The value of open-source benchmarks lies in reproducible real data, not leaderboard scores.

## Project Technical Details: Supported Models and Runtime Environment

**Supported Models**: Default is HuggingFaceTB/SmolLM2-1.7B-Instruct; alternative is Qwen/Qwen2.5-1.5B-Instruct.
**Runtime Environment**: Python3.13, PyTorch2.12.0, Transformers5.11.0, optimum-quanto0.2.7.
**Configuration Flexibility**: Default parameters are configured via TOML files, which can be overridden via command line; available backends are detected automatically.

## Conclusion: The Value of Honest Measurement and Project Significance

The transformers-laptop-bench project not only provides a practical benchmark tool but also demonstrates the value of honest measurement in machine learning engineering:
1. Platform differences are critical; CUDA optimization strategies may not apply to other platforms.
2. Performance optimization needs to be based on actual data, not theoretical inference.
3. Transparent methodology is more important than pretty numbers.
This project provides a reliable starting point for developers running LLMs locally, helping them make data-driven hardware and configuration decisions.
