Reading

Merlin: A High-Efficient Small Language Model Built from Scratch for Apple Silicon

Merlin is a high-efficient small language model project built from scratch specifically for Apple Silicon devices (MacBook Pro and iPhone). It uses PyTorch for training, MLX for inference, and custom Metal kernels. Under the int4 quantization + KV caching configuration, it achieves an inference speed of 625 TPS with a peak memory usage of only 188MB, fully fitting within the 4GB memory budget of iPhones.

Apple Silicon端侧推理小型语言模型MLXMetal内核int4量化KV缓存iPhone AIPyTorch开源LLM

Published 2026-04-09 17:07Recent activity 2026-04-09 17:18Estimated read 10 min

Section 01

Introduction / Main Post: Merlin: A High-Efficient Small Language Model Built from Scratch for Apple Silicon

Section 02

Project Background and Motivation

As large language models (LLMs) thrive in the cloud, the demand for on-device AI inference is growing rapidly. However, deploying LLMs on consumer devices faces severe challenges: memory limitations, computing power bottlenecks, and power consumption constraints. Especially for mobile devices like iPhones, achieving smooth AI inference within the limited 4GB memory budget is a highly technically challenging engineering problem.

The Merlin project was born to tackle this problem. It is a small language model built from scratch, optimized specifically for the Apple Silicon ecosystem (MacBook Pro and iPhone). It uses a pure local inference architecture, which is both educational and practically usable, and fully open-source.

Section 03

Core Objectives and Design Philosophy

The core objectives of the Merlin project are clear and focused:

Maximize inference TPS (Tokens Per Second) on Apple Silicon: Through deep optimization, fully unleash the potential of the neural network engines in M-series and A-series chips.
Minimize memory usage: Ensure the model runs smoothly on resource-constrained devices via quantization techniques and efficient memory management.
Train from scratch with real data: No reliance on pre-trained weights; fully independent training to ensure model transparency and controllability.
Custom Metal kernels: No dependence on default framework implementations; handwritten high-performance Metal kernels to squeeze out every bit of hardware performance.
iPhone target: Enable the model to run within an approximate 4GB memory budget via int4 quantization.

This design philosophy reflects a deep understanding of on-device AI: instead of pursuing large and comprehensive parameter counts, it focuses on achieving extreme efficiency and practicality under limited resource constraints.

Section 04

Performance Benchmarks: Impressive Measured Data

Merlin has shown impressive performance in benchmark tests. Take the test of the base model (about 117 million parameters) on the M4 MacBook Pro as an example:

Configuration	TPS (Tokens Per Second)	Peak Memory
fp32, no KV cache	38.7	1536 MB
fp32 + KV cache	242.6	802 MB
int4 + KV cache	625.3	188 MB

The most striking data is the int4 quantization with KV cache configuration: it achieves an inference speed of 625.3 TPS with only 188MB of memory usage. This means the model fits entirely within the 4GB memory budget of iPhones, while providing sufficiently fast response speeds to fully support real-time interactive applications.

Behind this achievement is the project team's deep engineering investment in quantization algorithms, memory management, and kernel optimization.

Section 05

Model Architecture: Compact and Efficient

Merlin uses a GPT-style decoder-only Transformer architecture and provides four configurations to adapt to different usage scenarios:

Configuration	Parameter Count	Embedding Dimension	Number of Attention Heads	Number of Layers	Context Length
sanity	~1.6 million	32	2	2	64
experiment	~21 million	256	8	8	512
iphone	~3.17 billion	3072	24	20	4096
macbook	~7.19 billion	4096	32	26	4096

From the micro configuration of 1.6 million parameters to the desktop-level configuration of 7.19 billion parameters, Merlin covers all scenario needs from rapid prototype verification to production deployment.

Section 06

Key Design Choices

Merlin has made a series of carefully balanced architectural decisions:

RMSNorm instead of LayerNorm: Removes mean subtraction operations for faster computation and better hardware acceleration.
SwiGLU MLP: Achieves better loss performance with the same parameter count compared to GELU activation.
Weight sharing: Token embedding layer and output head share weights, saving about 39 million parameters in the base configuration.
Linear layers without bias: Reduces parameter count and computation; a common practice in modern Transformers.
Pre-norm structure: Places normalization layers before residual connections for more stable training.

These design choices together form an efficient and stable integrated training-inference architecture.

Section 07

Tech Stack: Collaboration Between PyTorch and MLX

Merlin uses a layered tech stack design to fully leverage the advantages of each framework:

Role	Tool
Training	PyTorch + CUDA + Triton (NVIDIA)
Inference (Mac)	MLX + Custom Metal Kernels
Inference (iPhone)	CoreML (Planned)
Data	TinyStories via tiktoken GPT-2
Observability	Weights & Biases

PyTorch serves as the single source of truth for training, ensuring the stability and reproducibility of the training process. MLX is specifically used for inference, and together with handwritten Metal kernels, it achieves extreme performance. Weight conversion is explicit and verified through strict numerical consistency tests.

This clearly divided architecture design allows Merlin to enjoy the rich toolchain of the PyTorch ecosystem while fully utilizing the dedicated inference acceleration of Apple Silicon.

Section 08

Project Structure and Code Organization

Merlin's codebase has a clear structure and high modularity:

model.py: PyTorch Transformer implementation (for training)
infer.py: MLX inference implementation, including KV cache, int4 quantization, and custom kernels
train.py: Training loop (AdamW, gradient clipping, W&B logging, HF Hub checkpoints)
data.py: Tokenization and memory-mapped processing of the TinyStories dataset
convert.py: Weight conversion from PyTorch to MLX
bench.py: TPS and memory benchmark tests
test_e2e.py: PyTorch/MLX numerical consistency test (atol=1e-6, greedy token matching)
docs/: Detailed documentation on architecture, tech stack, training, and inference

Each module has a single responsibility and clear interfaces, making it easy to understand and extend.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15