Zing Forum

Reading

Fiber-Inference: A Systematic Evaluation Study on Large Model Inference Performance of Apple M4 Chip

The Fiber-Inference project conducted a comprehensive evaluation of the five computing units of the Apple M4 chip, revealing performance differences between backends like ANE, AMX, and GPU in LLM inference, providing important references for edge AI deployment.

Apple SiliconM4芯片端侧推理LLM推理优化ANEMLXAMX性能评测移动AI
Published 2026-04-05 21:09Recent activity 2026-04-05 21:19Estimated read 7 min
Fiber-Inference: A Systematic Evaluation Study on Large Model Inference Performance of Apple M4 Chip
1

Section 01

[Introduction] Fiber-Inference: Core Summary of Systematic Evaluation on Large Model Inference Performance of Apple M4 Chip

The Fiber-Inference project conducted a systematic evaluation of the five computing units of the Apple M4 chip (CPU, GPU, ANE, AMX, MLX optimized implementation) to address the hardware selection dilemma for edge large model inference. Through over 200 measurements, the study revealed key findings: ANE achieves a throughput of 21490 tokens/sec in the prefill phase; AMX is 1.8x faster than GPU; the MLX framework achieves a 2.2x speedup. These results provide important references for edge AI deployment.

2

Section 02

Research Background: Hardware Selection Dilemma for Edge Large Model Inference

With the popularization of LLM technology, the demand for efficient edge model operation is growing. Apple Silicon has become a popular choice due to its unified memory architecture and ANE, but developers face the challenge of choosing among multiple computing units (CPU, GPU, ANE, AMX) of the M4 chip. The Fiber-Inference project provides a data-driven answer to this problem through systematic performance evaluation.

3

Section 03

Research Methodology: Rigorous Hardware Evaluation Framework

The study uses a rigorous evaluation framework:

  • Computing Units: Covers CPU (high-performance cores), GPU, ANE, AMX, MLX optimized implementation
  • Test Scenarios: Separate tests for prefill and decoding phases
  • Model Scale: 1B to 70B parameters
  • Data Scale: Over 200 sets of independent measurement data It does not rely on a single metric to ensure the comprehensiveness of the results.
4

Section 04

Key Findings: Performance Differences and Application Scenarios of Five Computing Units

Key Findings Summary

  • ANE performs remarkably in the prefill phase: 21490 tokens/sec
  • AMX is 1.8x faster than GPU
  • MLX framework achieves a 2.2x speedup

Characteristics of Each Computing Unit

  • CPU: Versatile and flexible, high precision, but limited parallelism
  • GPU: Strong parallel computing, mature ecosystem, but higher power consumption
  • ANE: High energy efficiency ratio, outstanding prefill performance, closed programming model
  • AMX: Easy to use, excellent performance, better energy efficiency ratio than GPU
  • MLX: Unified memory management, operator fusion optimization, hardware-aware scheduling

These characteristics determine the application scenarios of different units.

5

Section 05

Performance Analysis: Hardware Performance Differences Between Prefill and Decoding Phases

LLM inference is divided into prefill and decoding phases, with significant differences in hardware requirements:

Prefill Phase (Compute-Intensive)

  • Needs to process the complete input sequence, with large computation and high parallelism
  • ANE performs best, benefiting from high memory bandwidth and parallel capabilities

Decoding Phase (Memory Bandwidth-Intensive)

  • Token-by-token generation
  • Performance gaps between units narrow, and quantization techniques can accelerate

The characteristics of the two phases affect hardware selection strategies.

6

Section 06

Practical Insights: Guide to Computing Backend Selection for Edge LLM Deployment

Based on the research results, recommendations for edge LLM deployment:

Scenario 1: Ultimate Performance

  • Use ANE for prefill, AMX/MLX optimization for decoding
  • Combine with INT4/INT8 quantization to reduce bandwidth pressure

Scenario 2: Development Efficiency Priority

  • First choice: MLX (official framework, API-friendly)
  • Alternative: PyTorch Metal (low migration cost)

Scenario 3: Specific Model Architecture

  • Fall back to CPU/GPU when non-standard operators or dynamic shapes are involved

Selection should be based on specific needs.

7

Section 07

Summary and Outlook: Research Limitations and Future Directions

Core Summary

  1. No silver bullet: Different computing units have their own advantages; choose based on needs
  2. Great potential for software optimization: MLX's 2.2x speedup proves the value of framework optimization
  3. ANE's potential is underestimated: Outstanding prefill performance

Research Limitations

  • Only targeted at M4 chip; conclusions are not applicable to all hardware
  • Limited test models, not covering all LLM architectures
  • Dependent on specific software versions

Future Directions

  • Multimodal model evaluation
  • Long context scenario analysis
  • Research on the impact of mixed precision

The project's paper and dataset have been open-sourced, providing a foundation for community research.