Section 01
Introduction / Main Post: Merlin: A High-Efficient Small Language Model Built from Scratch for Apple Silicon
Merlin is a high-efficient small language model project built from scratch specifically for Apple Silicon devices (MacBook Pro and iPhone). It uses PyTorch for training, MLX for inference, and custom Metal kernels. Under the int4 quantization + KV caching configuration, it achieves an inference speed of 625 TPS with a peak memory usage of only 188MB, fully fitting within the 4GB memory budget of iPhones.