Reading

Cadence: An Experimental LLM Inference Framework for Apple Silicon Based on MPSGraph

Cadence is an experimental macOS local LLM inference project built using Swift, SwiftUI, and Metal Performance Shaders Graph, focusing on verifying the GPU implementation of core Transformer operators.

SwiftMPSGraphApple SiliconMetalTransformer本地推理LLM端侧 AI

Published 2026-04-26 11:37Recent activity 2026-04-26 11:50Estimated read 5 min

Cadence: An Experimental LLM Inference Framework for Apple Silicon Based on MPSGraph

Section 01

Cadence: Experimental LLM Inference Framework for Apple Silicon Using MPSGraph

Cadence is an experimental project by Ostinato Labs, built with Swift, SwiftUI, and Metal Performance Shaders Graph (MPSGraph) for macOS native LLM inference. It focuses on verifying core Transformer operators on GPU, currently in early R&D phase (not a ready chat app). It serves as an operator testbed, CPU-GPU validation tool, and prototype for future local inference engines.

Section 02

Background & Project Project Positioning

With With Apple Silicon's growing power, leveraging Metal for LLM inference is a key focus for developers. Cadence is an early-stage R&D project, not a production-ready app. Its roles include:

Metal/MPSGraph operator experimentment field;
CPU-GPU output comparison validation;
Skeleton prototype for future local inference engines. It is not a complete chat app, Qwen runtime, or mature tested project.

Section 03

Technical Architecture & Core Transformer Operators

Cadence uses Apple native tech stack: Swift5 (language), SwiftUI (UI), MPSGraph (GPU acceleration). Key components:

Device management (MTLDevice, command queue, MPSGraphDevice) in Device.swift;
Tensor utils (data conversion) in TensorUtils.swift; Implemented Transformer operators:
Attention: single-head, multi-head (with causal mask), GQA;
RoPE: precompute cos/sin tables and apply;
Normalization: RMSNorm (with debug values), LayerNorm;
Activation: SWiGLU; Tokenizer: ByteShadowMap (byte-level reversible encoding, foundation for BPE).

Section 04

Validation & Testing Methods

Cadence uses manual test runners (not XCTest) compiled into the app, invoked via CadenceApp.init(). Current tests:

MatmulTest: CPU vs GPU consistency;
RMSNormTest, RoPETest (numeric/property), LayerNormTest, SWiGLUTest;
AttentionTest (single/multi/GQA), AttentionPerfTest (CPU-GPU performance);
ByteShadowMapTest (round-trip encoding). This approach offers flexibility in early R&D.

Section 05

Current Limitations & Model Assets

Model assets: Qwen3.5-4B files (tokenizer config, vocab, merges, partial safetensors) exist but are not loaded/used (missing first safetensors shard). Unimplemented features: default Hello World UI, no safetensors reader, no tokenizer parsing, no end-to-end Transformer block, no logits sampling/text generation, tests not in XCTest.

Section 06

Future Development Directions

Next steps for Cadence:

Add safetensors weight loading;
Parse tokenizer vocab and BPE rules;
Combine operators into full Transformer blocks;
Add embedding layer, LM head, KV cache;
Build end-to-end pipeline (prompt → tokens → logits → sampling → text);
Migrate tests to XCTest and add benchmarks.

Section 07

Project Significance & Value

Cadence's value:

Proves Swift+MPSGraph can implement Transformer core operators for Apple platform end-to-end AI;
Lightweight CPU-GPU validation method for operator correctness;
Open-source resource for learning Metal/MPSGraph and LLM inference on Apple Silicon;
Clear code structure and complete operator implementations make it a great learning material for developers interested in Apple native LLM inference.

Cadence: An Experimental LLM Inference Framework for Apple Silicon Based on MPSGraph

Cadence: Experimental LLM Inference Framework for Apple Silicon Using MPSGraph

Background & Project Project Positioning

Technical Architecture & Core Transformer Operators

Validation & Testing Methods

Current Limitations & Model Assets

Future Development Directions

Project Significance & Value

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model