Section 01
[Main Post/Introduction] Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques
Research Overview
This study was conducted by Mohamed Mostafa Fawzi Ahmed from Cairo University and published on June 13, 2026 (GitHub project: llm-edge-coreml). It focuses on comparing CoreML quantization techniques for LLM inference on Apple Silicon edge devices, with a core exploration of the impact of three quantization schemes (FP16, INT8, INT4) on the Phi-4 Mini (3.8B) and Mistral 7B models.
Key Takeaways
- Counterintuitive Memory Phenomenon: When not using the Neural Engine, the memory usage of quantized models (INT8/INT4) is about 51% higher than FP16 (due to dual storage: compressed weights + dequantized FP32 buffer);
- Quantization Benefits and Costs: INT4 achieves a 72% compression rate (significantly reducing disk usage), but inference speed decreases by 16-23%, with minimal precision loss (Mistral7B INT4 only drops 0.5% compared to INT8);
- Platform Limitations: Mistral7B cannot infer on macOS due to missing KV-cache API; Phi-4 Mini cannot complete MMLU multi-token evaluation due to CoreML Python API limitations.