Zing Forum

Reading

Merlin: A High-Efficient Small Language Model Built from Scratch for Apple Silicon

Merlin is a high-efficient small language model project built from scratch specifically for Apple Silicon devices (MacBook Pro and iPhone). It uses PyTorch for training, MLX for inference, and custom Metal kernels. Under the int4 quantization + KV caching configuration, it achieves an inference speed of 625 TPS with a peak memory usage of only 188MB, fully fitting within the 4GB memory budget of iPhones.

Apple Silicon端侧推理小型语言模型MLXMetal内核int4量化KV缓存iPhone AIPyTorch开源LLM
Published 2026-04-09 17:07Recent activity 2026-04-09 17:18Estimated read 10 min
Merlin: A High-Efficient Small Language Model Built from Scratch for Apple Silicon
1

Section 01

Introduction / Main Post: Merlin: A High-Efficient Small Language Model Built from Scratch for Apple Silicon

Merlin is a high-efficient small language model project built from scratch specifically for Apple Silicon devices (MacBook Pro and iPhone). It uses PyTorch for training, MLX for inference, and custom Metal kernels. Under the int4 quantization + KV caching configuration, it achieves an inference speed of 625 TPS with a peak memory usage of only 188MB, fully fitting within the 4GB memory budget of iPhones.

2

Section 02

Project Background and Motivation

As large language models (LLMs) thrive in the cloud, the demand for on-device AI inference is growing rapidly. However, deploying LLMs on consumer devices faces severe challenges: memory limitations, computing power bottlenecks, and power consumption constraints. Especially for mobile devices like iPhones, achieving smooth AI inference within the limited 4GB memory budget is a highly technically challenging engineering problem.

The Merlin project was born to tackle this problem. It is a small language model built from scratch, optimized specifically for the Apple Silicon ecosystem (MacBook Pro and iPhone). It uses a pure local inference architecture, which is both educational and practically usable, and fully open-source.

3

Section 03

Core Objectives and Design Philosophy

The core objectives of the Merlin project are clear and focused:

  1. Maximize inference TPS (Tokens Per Second) on Apple Silicon: Through deep optimization, fully unleash the potential of the neural network engines in M-series and A-series chips.
  2. Minimize memory usage: Ensure the model runs smoothly on resource-constrained devices via quantization techniques and efficient memory management.
  3. Train from scratch with real data: No reliance on pre-trained weights; fully independent training to ensure model transparency and controllability.
  4. Custom Metal kernels: No dependence on default framework implementations; handwritten high-performance Metal kernels to squeeze out every bit of hardware performance.
  5. iPhone target: Enable the model to run within an approximate 4GB memory budget via int4 quantization.

This design philosophy reflects a deep understanding of on-device AI: instead of pursuing large and comprehensive parameter counts, it focuses on achieving extreme efficiency and practicality under limited resource constraints.

4

Section 04

Performance Benchmarks: Impressive Measured Data

Merlin has shown impressive performance in benchmark tests. Take the test of the base model (about 117 million parameters) on the M4 MacBook Pro as an example:

Configuration TPS (Tokens Per Second) Peak Memory
fp32, no KV cache 38.7 1536 MB
fp32 + KV cache 242.6 802 MB
int4 + KV cache 625.3 188 MB

The most striking data is the int4 quantization with KV cache configuration: it achieves an inference speed of 625.3 TPS with only 188MB of memory usage. This means the model fits entirely within the 4GB memory budget of iPhones, while providing sufficiently fast response speeds to fully support real-time interactive applications.

Behind this achievement is the project team's deep engineering investment in quantization algorithms, memory management, and kernel optimization.

5

Section 05

Model Architecture: Compact and Efficient

Merlin uses a GPT-style decoder-only Transformer architecture and provides four configurations to adapt to different usage scenarios:

Configuration Parameter Count Embedding Dimension Number of Attention Heads Number of Layers Context Length
sanity ~1.6 million 32 2 2 64
experiment ~21 million 256 8 8 512
iphone ~3.17 billion 3072 24 20 4096
macbook ~7.19 billion 4096 32 26 4096

From the micro configuration of 1.6 million parameters to the desktop-level configuration of 7.19 billion parameters, Merlin covers all scenario needs from rapid prototype verification to production deployment.

6

Section 06

Key Design Choices

Merlin has made a series of carefully balanced architectural decisions:

  1. RMSNorm instead of LayerNorm: Removes mean subtraction operations for faster computation and better hardware acceleration.
  2. SwiGLU MLP: Achieves better loss performance with the same parameter count compared to GELU activation.
  3. Weight sharing: Token embedding layer and output head share weights, saving about 39 million parameters in the base configuration.
  4. Linear layers without bias: Reduces parameter count and computation; a common practice in modern Transformers.
  5. Pre-norm structure: Places normalization layers before residual connections for more stable training.

These design choices together form an efficient and stable integrated training-inference architecture.

7

Section 07

Tech Stack: Collaboration Between PyTorch and MLX

Merlin uses a layered tech stack design to fully leverage the advantages of each framework:

Role Tool
Training PyTorch + CUDA + Triton (NVIDIA)
Inference (Mac) MLX + Custom Metal Kernels
Inference (iPhone) CoreML (Planned)
Data TinyStories via tiktoken GPT-2
Observability Weights & Biases

PyTorch serves as the single source of truth for training, ensuring the stability and reproducibility of the training process. MLX is specifically used for inference, and together with handwritten Metal kernels, it achieves extreme performance. Weight conversion is explicit and verified through strict numerical consistency tests.

This clearly divided architecture design allows Merlin to enjoy the rich toolchain of the PyTorch ecosystem while fully utilizing the dedicated inference acceleration of Apple Silicon.

8

Section 08

Project Structure and Code Organization

Merlin's codebase has a clear structure and high modularity:

  • model.py: PyTorch Transformer implementation (for training)
  • infer.py: MLX inference implementation, including KV cache, int4 quantization, and custom kernels
  • train.py: Training loop (AdamW, gradient clipping, W&B logging, HF Hub checkpoints)
  • data.py: Tokenization and memory-mapped processing of the TinyStories dataset
  • convert.py: Weight conversion from PyTorch to MLX
  • bench.py: TPS and memory benchmark tests
  • test_e2e.py: PyTorch/MLX numerical consistency test (atol=1e-6, greedy token matching)
  • docs/: Detailed documentation on architecture, tech stack, training, and inference

Each module has a single responsibility and clear interfaces, making it easy to understand and extend.