Zing Forum

Reading

Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques

A systematic study on Apple Silicon edge devices comparing the impact of FP16, INT8, and INT4 quantization on Phi-4 Mini and Mistral 7B models, revealing unexpected memory overhead and precision trade-offs of quantization in edge inference.

LLM量化CoreMLApple Silicon边缘推理INT4INT8Phi-4Mistral
Published 2026-06-14 06:44Recent activity 2026-06-14 06:56Estimated read 10 min
Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques
1

Section 01

[Main Post/Introduction] Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques

Research Overview

This study was conducted by Mohamed Mostafa Fawzi Ahmed from Cairo University and published on June 13, 2026 (GitHub project: llm-edge-coreml). It focuses on comparing CoreML quantization techniques for LLM inference on Apple Silicon edge devices, with a core exploration of the impact of three quantization schemes (FP16, INT8, INT4) on the Phi-4 Mini (3.8B) and Mistral 7B models.

Key Takeaways

  1. Counterintuitive Memory Phenomenon: When not using the Neural Engine, the memory usage of quantized models (INT8/INT4) is about 51% higher than FP16 (due to dual storage: compressed weights + dequantized FP32 buffer);
  2. Quantization Benefits and Costs: INT4 achieves a 72% compression rate (significantly reducing disk usage), but inference speed decreases by 16-23%, with minimal precision loss (Mistral7B INT4 only drops 0.5% compared to INT8);
  3. Platform Limitations: Mistral7B cannot infer on macOS due to missing KV-cache API; Phi-4 Mini cannot complete MMLU multi-token evaluation due to CoreML Python API limitations.
2

Section 02

Research Background and Motivation

With the popularization of LLMs in various scenarios, efficient deployment on resource-constrained edge devices has become a key challenge. Apple Silicon has become an important platform for edge inference due to its unified memory architecture and Neural Engine, but the actual performance of quantization (often regarded as a means to reduce memory and improve speed) on Apple Silicon is not yet clear, which is the core issue of this study.

3

Section 03

Experimental Design

Test Environment

  • Device: MacBook Pro 14-inch (2021)
  • Chip: Apple M1 Pro
  • Memory: 16GB unified memory
  • System: macOS Tahoe 26

Test Models

Model Parameter Count Features
Phi-4 Mini ~3.8B Microsoft open-source small model, suitable for edge deployment
Mistral7B 7B High-performance open-source model, challenging benchmark for edge deployment

Quantization Schemes

Compare three precisions: FP16 (baseline), INT8 (common compression), INT4 (extreme compression).

Evaluation Metrics

Disk usage (MB), peak memory (GB), inference latency (tok/s), MMLU accuracy (%).

4

Section 04

Key Findings

Quantization and Disk Space

Quantization compression effect is significant:

Model Quantization Disk Size Compression Rate
Phi-4 Mini FP16 7673 MB 100%
Phi-4 Mini INT8 3840 MB 50%
Phi-4 Mini INT4 2159 MB 28%
Mistral7B FP16 13826 MB 100%
Mistral7B INT8 6917 MB 50%
Mistral7B INT4 3890 MB 28%

Unexpected Memory Overhead

Quantization did not reduce memory; instead, it increased:

Model Quantization Peak Memory
Phi-4 Mini FP16 16.26 GB
Phi-4 Mini INT8 24.55 GB
Phi-4 Mini INT4 24.60 GB
Reason: Under CPU/GPU path, CoreML needs to retain both compressed weights and dequantized FP32 buffer (dual storage).

Inference Speed

Phi-4 Mini speed:

Quantization Speed (tok/s)
FP16 3.93
INT8 3.30
INT4 3.02
Quantization leads to a 16-23% decrease in speed.

Impact on Precision

Mistral7B MMLU accuracy:

Quantization MMLU Accuracy
INT8 51.1%
INT4 50.6%
INT4 only loses 0.5 percentage points, with minimal precision cost.
5

Section 05

Platform Limitations and Findings

  1. Mistral7B macOS Inference Limitation: The MLModel.newState() API required for Apple stateful KV-cache is only available on iOS, not macOS, making it impossible for 7B models to perform full inference on macOS.
  2. Phi-4 Mini MMLU Test Limitation: The CoreML Python API only exposes a single-token inference interface and cannot support MMLU evaluation with multi-token generation, so MMLU data for Phi-4 Mini is not reported.
6

Section 06

Practical Recommendations

1. Quantization Strategy Selection

  • Storage-constrained: INT4 significantly reduces disk usage, suitable for distribution and storage;
  • Memory-constrained: Need to evaluate Neural Engine availability; FP16 is better when only CPU/GPU is used;
  • Precision-sensitive: The precision gap between INT8 and INT4 is minimal; INT4 offers higher cost-effectiveness.

2. Platform Adaptation

  • 7B+ models need to pay attention to cross-platform support for KV-cache;
  • For performance-critical scenarios, it is recommended to use Swift native applications to get more complete API support.

3. Evaluation Methods

  • Edge benchmark tests need to focus on disk, memory, and speed simultaneously;
  • Pay attention to distinguishing API function limitations (single-token vs multi-token generation).
7

Section 07

Research Value and Significance

  1. Reveal Quantization Misconceptions: Break the perception that "quantization necessarily reduces memory" and emphasize the importance of platform-specific behaviors;
  2. Verify INT4 Practicality: INT4 has minimal precision loss on 7B models, providing data support for edge deployment;
  3. Document Platform Limitations: Record the actual limitations of CoreML on macOS to help developers set reasonable expectations;
  4. Reproducibility: Provide complete code and data for easy verification and expansion.
8

Section 08

Conclusion

LLM edge deployment requires collaborative optimization of models, frameworks, and hardware. This study shows that the intuition of "quantization = better" does not always hold on Apple Silicon. Developers need to consider comprehensively:

  • Target hardware computing units (CPU/GPU/Neural Engine);
  • Storage and memory constraints;
  • Task precision sensitivity;
  • Platform API function completeness. Only by considering these factors comprehensively can the optimal deployment decision be made.