Reading

llm-bench: MLX vs GGUF Inference Benchmark Framework for Apple Silicon

llm-bench is a comprehensive benchmarking tool designed specifically for Apple Silicon, systematically comparing the inference performance of MLX and GGUF model formats across multiple dimensions including prompt processing speed, generation speed, memory usage, and output quality.

MLXGGUFApple Silicon基准测试大语言模型推理性能量化M5 Max

Published 2026-04-28 17:43Recent activity 2026-04-28 17:55Estimated read 7 min

llm-bench: MLX vs GGUF Inference Benchmark Framework for Apple Silicon

Section 01

[Introduction] llm-bench: MLX vs GGUF Inference Performance Benchmark Framework for Apple Silicon

llm-bench is a comprehensive benchmarking tool designed specifically for Apple Silicon, aiming to systematically compare the inference performance of MLX (Apple's native framework) and GGUF (cross-platform format via llama.cpp) model formats. It covers multi-dimensional metrics such as prompt processing speed, generation speed, memory usage, and output quality, helping developers make data-driven technical choices. It is one of the signs of the maturity of Apple Silicon's local AI ecosystem.

Section 02

Evaluation Background and Motivation

With the rise of Apple Silicon (especially M-series chips) in the field of local LLM inference, developers face the technical dilemma of choosing between MLX and GGUF. The llm-bench project developed by haxlys is not just a simple speed testing tool but a systematic evaluation framework, aiming to isolate runtime differences and accurately measure performance gaps of the same model in different formats.

Section 03

Core Evaluation Dimensions

llm-bench evaluates performance from four key dimensions:

Prompt Processing Speed (PP)：Measures input prompt throughput (tokens/second), which is crucial for scenarios like long document understanding and RAG;
Token Generation Speed (TG)：Measures the speed of new token generation, directly affecting interactive chat experiences;
Memory Usage：Validates peak memory through dual verification using /usr/bin/time and MLX's mx.metal.get_peak_memory(), determining the maximum model size a device can load;
Output Quality：Calculates cosine similarity using sentence-transformers to evaluate semantic differences (often ignored by traditional benchmarks but critical for production).

Section 04

Testing Methodology and Toolchain

Model Management：Driven by a YAML registry, preconfigured with Gemma4 26B-MoE (6 variants) and 31B Dense (2 variants); adding new models only requires modifying registry.yaml and running sync_models.py; Scenario Design：Prompt lengths (256/1024/4096/8192 tokens), generation lengths (128/512), and repetition times (3 formal runs +1 warm-up) simulate real-world usage; Toolchain：Supports model synchronization (auto-download missing variants), smoke testing (quick validation), full matrix testing, Streamlit visualization dashboard, and Quarto static report generation.

Section 05

Key Findings and Technical Insights

Preliminary test results based on Gemma4 26B-MoE:

Speed Comparison：MLX-8bit has higher throughput in the prompt processing phase (thanks to Unified Memory and Metal optimization), but the gap may narrow or reverse when generating long sequences;
Memory Efficiency：MLX 8-bit quantization uses slightly less peak memory than GGUF Q8_0, which is beneficial for loading larger batches or longer contexts;
Output Consistency：Due to differences in quantization algorithms (MLX custom 8-bit vs GGUF Q8_0), outputs may have subtle semantic differences, requiring tool-based quantitative evaluation.

Section 06

Usage Recommendations and Best Practices

The project documentation emphasizes:

Avoid Metal Resource Competition：Close other MLX services (e.g., llm-stack) before running, otherwise performance may drop by 2-5x or cause OOM;
重视 Preheating：Metal GPU needs preheating to reach a stable state; the tool has designed warm-up runs to eliminate variations;
Ensure Reproducibility：Record system version, MLX version, and llama.cpp version, and test in a controlled environment.

Section 07

Ecosystem Significance and Future Directions

Ecosystem Significance：llm-bench helps developers make data-driven choices, demonstrate quantization benefits, track version evolution, and identify optimization opportunities, providing an empirical foundation for production LLM inference on Apple Silicon; Future Directions：Support more quantization schemes (MLX 4-bit, GGUF Q5_K_M, etc.), integrate more quality metrics (perplexity, downstream accuracy), batch concurrent testing, power consumption monitoring, etc., to solidify its position as a standard tool.

llm-bench: MLX vs GGUF Inference Benchmark Framework for Apple Silicon

[Introduction] llm-bench: MLX vs GGUF Inference Performance Benchmark Framework for Apple Silicon

Evaluation Background and Motivation

Core Evaluation Dimensions

Testing Methodology and Toolchain

Key Findings and Technical Insights

Usage Recommendations and Best Practices

Ecosystem Significance and Future Directions

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model