Reading

Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques

A systematic study on Apple Silicon edge devices comparing the impact of FP16, INT8, and INT4 quantization on Phi-4 Mini and Mistral 7B models, revealing unexpected memory overhead and precision trade-offs of quantization in edge inference.

LLM量化CoreMLApple Silicon边缘推理INT4INT8Phi-4Mistral

Published 2026-06-14 06:44Recent activity 2026-06-14 06:56Estimated read 10 min

Section 01

[Main Post/Introduction] Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques

Research Overview

This study was conducted by Mohamed Mostafa Fawzi Ahmed from Cairo University and published on June 13, 2026 (GitHub project: llm-edge-coreml). It focuses on comparing CoreML quantization techniques for LLM inference on Apple Silicon edge devices, with a core exploration of the impact of three quantization schemes (FP16, INT8, INT4) on the Phi-4 Mini (3.8B) and Mistral 7B models.

Key Takeaways

Counterintuitive Memory Phenomenon: When not using the Neural Engine, the memory usage of quantized models (INT8/INT4) is about 51% higher than FP16 (due to dual storage: compressed weights + dequantized FP32 buffer);
Quantization Benefits and Costs: INT4 achieves a 72% compression rate (significantly reducing disk usage), but inference speed decreases by 16-23%, with minimal precision loss (Mistral7B INT4 only drops 0.5% compared to INT8);
Platform Limitations: Mistral7B cannot infer on macOS due to missing KV-cache API; Phi-4 Mini cannot complete MMLU multi-token evaluation due to CoreML Python API limitations.

Section 02

Research Background and Motivation

With the popularization of LLMs in various scenarios, efficient deployment on resource-constrained edge devices has become a key challenge. Apple Silicon has become an important platform for edge inference due to its unified memory architecture and Neural Engine, but the actual performance of quantization (often regarded as a means to reduce memory and improve speed) on Apple Silicon is not yet clear, which is the core issue of this study.

Section 03

Experimental Design

Test Environment

Device: MacBook Pro 14-inch (2021)
Chip: Apple M1 Pro
Memory: 16GB unified memory
System: macOS Tahoe 26

Test Models

Model	Parameter Count	Features
Phi-4 Mini	~3.8B	Microsoft open-source small model, suitable for edge deployment
Mistral7B	7B	High-performance open-source model, challenging benchmark for edge deployment

Quantization Schemes

Compare three precisions: FP16 (baseline), INT8 (common compression), INT4 (extreme compression).

Evaluation Metrics

Disk usage (MB), peak memory (GB), inference latency (tok/s), MMLU accuracy (%).

Section 04

Key Findings

Quantization and Disk Space

Quantization compression effect is significant:

Model	Quantization	Disk Size	Compression Rate
Phi-4 Mini	FP16	7673 MB	100%
Phi-4 Mini	INT8	3840 MB	50%
Phi-4 Mini	INT4	2159 MB	28%
Mistral7B	FP16	13826 MB	100%
Mistral7B	INT8	6917 MB	50%
Mistral7B	INT4	3890 MB	28%

Unexpected Memory Overhead

Quantization did not reduce memory; instead, it increased:

Model	Quantization	Peak Memory
Phi-4 Mini	FP16	16.26 GB
Phi-4 Mini	INT8	24.55 GB
Phi-4 Mini	INT4	24.60 GB
Reason: Under CPU/GPU path, CoreML needs to retain both compressed weights and dequantized FP32 buffer (dual storage).

Inference Speed

Phi-4 Mini speed:

Quantization	Speed (tok/s)
FP16	3.93
INT8	3.30
INT4	3.02
Quantization leads to a 16-23% decrease in speed.

Impact on Precision

Mistral7B MMLU accuracy:

Quantization	MMLU Accuracy
INT8	51.1%
INT4	50.6%
INT4 only loses 0.5 percentage points, with minimal precision cost.

Section 05

Platform Limitations and Findings

Mistral7B macOS Inference Limitation: The MLModel.newState() API required for Apple stateful KV-cache is only available on iOS, not macOS, making it impossible for 7B models to perform full inference on macOS.
Phi-4 Mini MMLU Test Limitation: The CoreML Python API only exposes a single-token inference interface and cannot support MMLU evaluation with multi-token generation, so MMLU data for Phi-4 Mini is not reported.

Section 06

Practical Recommendations

1. Quantization Strategy Selection

Storage-constrained: INT4 significantly reduces disk usage, suitable for distribution and storage;
Memory-constrained: Need to evaluate Neural Engine availability; FP16 is better when only CPU/GPU is used;
Precision-sensitive: The precision gap between INT8 and INT4 is minimal; INT4 offers higher cost-effectiveness.

2. Platform Adaptation

7B+ models need to pay attention to cross-platform support for KV-cache;
For performance-critical scenarios, it is recommended to use Swift native applications to get more complete API support.

3. Evaluation Methods

Edge benchmark tests need to focus on disk, memory, and speed simultaneously;
Pay attention to distinguishing API function limitations (single-token vs multi-token generation).

Section 07

Research Value and Significance

Reveal Quantization Misconceptions: Break the perception that "quantization necessarily reduces memory" and emphasize the importance of platform-specific behaviors;
Verify INT4 Practicality: INT4 has minimal precision loss on 7B models, providing data support for edge deployment;
Document Platform Limitations: Record the actual limitations of CoreML on macOS to help developers set reasonable expectations;
Reproducibility: Provide complete code and data for easy verification and expansion.

Section 08

Conclusion

LLM edge deployment requires collaborative optimization of models, frameworks, and hardware. This study shows that the intuition of "quantization = better" does not always hold on Apple Silicon. Developers need to consider comprehensively:

Target hardware computing units (CPU/GPU/Neural Engine);
Storage and memory constraints;
Task precision sensitivity;
Platform API function completeness. Only by considering these factors comprehensively can the optimal deployment decision be made.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23