Zing 论坛

正文

Swift Gemma4Core:在 Apple 设备上原生运行 Google Gemma 4 的纯 Swift 推理引擎

Gemma4SwiftCore 是首个纯 Swift 实现的 Google Gemma 4 文本解码器,支持在 iPhone、iPad 和 Mac 上 100% 本地运行,无需 Python 运行时或 CoreML 转换。

Gemma 4SwiftApple SiliconMLX本地推理iOSmacOS大语言模型端侧 AI
发布时间 2026/04/08 14:16最近活动 2026/04/08 14:19预计阅读 6 分钟
Swift Gemma4Core:在 Apple 设备上原生运行 Google Gemma 4 的纯 Swift 推理引擎
1

章节 01

Gemma4SwiftCore: First Pure Swift Gemma4 Inference Engine for Apple Devices

Gemma4SwiftCore is the first pure Swift implementation of Google Gemma4 text decoder, enabling 100% local inference on iPhone/iPad/Mac without Python runtime or CoreML conversion. It solves key issues in existing Apple ecosystem solutions for Gemma4 deployment, providing a native path for iOS/macOS developers to integrate advanced LLM capabilities.

2

章节 02

Project Background & Motivation

When Google released Gemma4 in April 2026, Apple's mlx-swift-lm v2.31.x lacked native support. Patching Gemma3's implementation to fit Gemma4 failed at weight loading due to 5 key architectural differences. Additionally, swift-jinja 1.x caused silent chat template errors, leading to fluent but irrelevant responses. Gemma4SwiftCore was built to address these issues, with full Swift decoder porting and a chat template bypass ensuring token sequence consistency with Python's mlx-lm.

3

章节 03

Core Technical Architecture

  1. Per-Layer Embedding (PLE): Each decoder layer uses a small MLP to gate shared embedding vectors, adding as a third residual connection for multi-granularity semantic capture.
  2. Cross-Layer KV Sharing: Last 20 of 35 layers reuse K/V tensors from earlier layers, reducing memory via a 'donor table' and global RoPE offset.
  3. Proportional RoPE: Custom Gemma4ProportionalRoPE class implemented to handle Gemma4's partial rotation RoPE (not supported by mlx-swift-lm).
  4. Chat Template Bypass: Avoids swift-jinja issues by building literal strings with markers, ensuring token IDs match Python's mlx-lm.
4

章节 04

Performance & Real-Device Test Data

Tested on iPhone (Apple A-series,7.4GB RAM) with mlx-community/gemma-4-e2b-it-4bit checkpoint:

  • Cold start (download+init): ~110s (one-time).
  • Hot start: ~6s.
  • Memory usage after load:341-392MB (well below 2GB target).
  • First audio block generation:2.82s (end-to-end TTS pipeline, including 333-token system prompt).
  • Throughput:12-14 tokens/sec. These metrics enable smooth interactive experiences on consumer mobile devices.
5

章节 05

Integration & Usage Guide

Distributed via Swift Package Manager. Key steps:

  1. Register sidecar processor: await Gemma4Registration.registerIfNeeded().value.
  2. Load 4-bit weights from HuggingFace: let container = try await LLMModelFactory.shared.loadContainer(configuration: ModelConfiguration(id: Gemma4SwiftCore.verifiedModelId)).
  3. Format prompt with bypass: let prompt = Gemma4PromptFormatter.userTurn("请讲一个关于好奇的小狐狸的短故事。").
  4. Stream generate tokens: let stream = try await container.generate(input: input, parameters: GenerateParameters(maxTokens: 200, temperature: 0.8, topP: 0.95)). Model weights (~1.5GB) are cached locally after first download.
6

章节 06

Comparison with Existing Solutions

Feature Gemma4SwiftCore mlx-swift-lm (upstream) swift-coreml-transformers
Gemma4 support
Per-Layer Embedding N/A N/A
Cross-Layer KV Sharing N/A N/A
Proportional RoPE
Chat Template Bypass ❌ (jinja broken) N/A
Pure Swift (no Python)
iOS+macOS support
Gemma4SwiftCore fills the Gemma4 support gap in Apple ecosystem.
7

章节 07

Future Outlook & Conclusion

Future Roadmap:

  • v0.2: KV cache quantization, larger context window benchmarks.
  • v0.3: Gemma4 E4B variant support, streaming API.
  • v1.0: Stable public API, semantic versioning. Conclusion: Gemma4SwiftCore advances mobile LLM deployment by lowering Gemma4 integration barriers in Apple ecosystem via pure Swift implementation and optimized architecture. It's a valuable tool for developers pursuing on-device AI capabilities. Note: Code uses MIT license; Gemma4 weights follow Google's separate license (review before app release).