# Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques

> A systematic study on Apple Silicon edge devices comparing the impact of FP16, INT8, and INT4 quantization on Phi-4 Mini and Mistral 7B models, revealing unexpected memory overhead and precision trade-offs of quantization in edge inference.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-13T22:44:33.000Z
- 最近活动: 2026-06-13T22:56:15.299Z
- 热度: 161.8
- 关键词: LLM, 量化, CoreML, Apple Silicon, 边缘推理, INT4, INT8, Phi-4, Mistral
- 页面链接: https://www.zingnex.cn/en/forum/thread/apple-siliconllm-coreml
- Canonical: https://www.zingnex.cn/forum/thread/apple-siliconllm-coreml
- Markdown 来源: floors_fallback

---

## [Main Post/Introduction] Apple Silicon Edge Device LLM Inference Optimization: A Comparative Study of CoreML Quantization Techniques

### Research Overview
This study was conducted by Mohamed Mostafa Fawzi Ahmed from Cairo University and published on June 13, 2026 (GitHub project: [llm-edge-coreml](https://github.com/mohamedfawzidev/llm-edge-coreml)). It focuses on comparing CoreML quantization techniques for LLM inference on Apple Silicon edge devices, with a core exploration of the impact of three quantization schemes (FP16, INT8, INT4) on the Phi-4 Mini (3.8B) and Mistral 7B models.

### Key Takeaways
1. **Counterintuitive Memory Phenomenon**: When not using the Neural Engine, the memory usage of quantized models (INT8/INT4) is about 51% higher than FP16 (due to dual storage: compressed weights + dequantized FP32 buffer);
2. **Quantization Benefits and Costs**: INT4 achieves a 72% compression rate (significantly reducing disk usage), but inference speed decreases by 16-23%, with minimal precision loss (Mistral7B INT4 only drops 0.5% compared to INT8);
3. **Platform Limitations**: Mistral7B cannot infer on macOS due to missing KV-cache API; Phi-4 Mini cannot complete MMLU multi-token evaluation due to CoreML Python API limitations.

## Research Background and Motivation

With the popularization of LLMs in various scenarios, efficient deployment on resource-constrained edge devices has become a key challenge. Apple Silicon has become an important platform for edge inference due to its unified memory architecture and Neural Engine, but the actual performance of quantization (often regarded as a means to reduce memory and improve speed) on Apple Silicon is not yet clear, which is the core issue of this study.

## Experimental Design

#### Test Environment
- Device: MacBook Pro 14-inch (2021)
- Chip: Apple M1 Pro
- Memory: 16GB unified memory
- System: macOS Tahoe 26

#### Test Models
| Model | Parameter Count | Features |
|-------|-----------------|----------|
| Phi-4 Mini | ~3.8B | Microsoft open-source small model, suitable for edge deployment |
| Mistral7B |7B | High-performance open-source model, challenging benchmark for edge deployment |

#### Quantization Schemes
Compare three precisions: FP16 (baseline), INT8 (common compression), INT4 (extreme compression).

#### Evaluation Metrics
Disk usage (MB), peak memory (GB), inference latency (tok/s), MMLU accuracy (%).

## Key Findings

#### Quantization and Disk Space
Quantization compression effect is significant:
| Model | Quantization | Disk Size | Compression Rate |
|-------|--------------|-----------|------------------|
| Phi-4 Mini | FP16 |7673 MB |100% |
| Phi-4 Mini | INT8 |3840 MB |50% |
| Phi-4 Mini | INT4 |2159 MB |28% |
| Mistral7B | FP16 |13826 MB |100% |
| Mistral7B | INT8 |6917 MB |50% |
| Mistral7B | INT4 |3890 MB |28% |

#### Unexpected Memory Overhead
Quantization did not reduce memory; instead, it increased:
| Model | Quantization | Peak Memory |
|-------|--------------|-------------|
| Phi-4 Mini | FP16 |16.26 GB |
| Phi-4 Mini | INT8 |24.55 GB |
| Phi-4 Mini | INT4 |24.60 GB |
Reason: Under CPU/GPU path, CoreML needs to retain both compressed weights and dequantized FP32 buffer (dual storage).

#### Inference Speed
Phi-4 Mini speed:
| Quantization | Speed (tok/s) |
|--------------|---------------|
| FP16 |3.93 |
| INT8 |3.30 |
| INT4 |3.02 |
Quantization leads to a 16-23% decrease in speed.

#### Impact on Precision
Mistral7B MMLU accuracy:
| Quantization | MMLU Accuracy |
|--------------|---------------|
| INT8 |51.1% |
| INT4 |50.6% |
INT4 only loses 0.5 percentage points, with minimal precision cost.

## Platform Limitations and Findings

1. **Mistral7B macOS Inference Limitation**: The `MLModel.newState()` API required for Apple stateful KV-cache is only available on iOS, not macOS, making it impossible for 7B models to perform full inference on macOS.
2. **Phi-4 Mini MMLU Test Limitation**: The CoreML Python API only exposes a single-token inference interface and cannot support MMLU evaluation with multi-token generation, so MMLU data for Phi-4 Mini is not reported.

## Practical Recommendations

#### 1. Quantization Strategy Selection
- Storage-constrained: INT4 significantly reduces disk usage, suitable for distribution and storage;
- Memory-constrained: Need to evaluate Neural Engine availability; FP16 is better when only CPU/GPU is used;
- Precision-sensitive: The precision gap between INT8 and INT4 is minimal; INT4 offers higher cost-effectiveness.

#### 2. Platform Adaptation
- 7B+ models need to pay attention to cross-platform support for KV-cache;
- For performance-critical scenarios, it is recommended to use Swift native applications to get more complete API support.

#### 3. Evaluation Methods
- Edge benchmark tests need to focus on disk, memory, and speed simultaneously;
- Pay attention to distinguishing API function limitations (single-token vs multi-token generation).

## Research Value and Significance

1. **Reveal Quantization Misconceptions**: Break the perception that "quantization necessarily reduces memory" and emphasize the importance of platform-specific behaviors;
2. **Verify INT4 Practicality**: INT4 has minimal precision loss on 7B models, providing data support for edge deployment;
3. **Document Platform Limitations**: Record the actual limitations of CoreML on macOS to help developers set reasonable expectations;
4. **Reproducibility**: Provide complete code and data for easy verification and expansion.

## Conclusion

LLM edge deployment requires collaborative optimization of models, frameworks, and hardware. This study shows that the intuition of "quantization = better" does not always hold on Apple Silicon. Developers need to consider comprehensively:
- Target hardware computing units (CPU/GPU/Neural Engine);
- Storage and memory constraints;
- Task precision sensitivity;
- Platform API function completeness.
Only by considering these factors comprehensively can the optimal deployment decision be made.
