# mobile-model-SDK: On-Device Multimodal Large Model Inference Framework for iOS and macOS

> mobile-model-SDK is an on-device multimodal large model inference SDK for iOS and macOS, supporting fully offline operation of models like MiniCPM-V and Gemma 4 on devices, and providing API interfaces compatible with OpenAI and Anthropic.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-07T01:31:51.000Z
- 最近活动: 2026-06-07T01:53:35.715Z
- 热度: 163.6
- 关键词: 端侧 AI, 多模态大模型, iOS, macOS, llama.cpp, MiniCPM-V, Gemma 4, 离线推理, Swift, Metal
- 页面链接: https://www.zingnex.cn/en/forum/thread/mobile-model-sdk-ios-macos
- Canonical: https://www.zingnex.cn/forum/thread/mobile-model-sdk-ios-macos
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: mobile-model-SDK: On-Device Multimodal Large Model Inference Framework for iOS and macOS

mobile-model-SDK is an on-device multimodal large model inference SDK for iOS and macOS, supporting fully offline operation of models like MiniCPM-V and Gemma 4 on devices, and providing API interfaces compatible with OpenAI and Anthropic.

## Original Author and Source

- Original Author/Maintainer: Shiyao-Huang
- Source Platform: GitHub
- Original Title: mobile-model-SDK
- Original Link: https://github.com/Shiyao-Huang/mobile-model-SDK
- Source Publication/Update Time: 2026-06-07T01:31:51Z

## Introduction: The Rise of On-Device AI

With the rapid development of Large Language Model (LLM) technology, more and more application scenarios are migrating AI capabilities from the cloud to local devices. On-device AI has many advantages: no network connection required, data privacy protected, lower response latency, and no API call limits. However, running multimodal large models on mobile devices has always been a technical challenge—how to achieve high-quality text, image, and even audio understanding with limited computing resources?

mobile-model-SDK is an open-source project born to address this challenge. It is an on-device multimodal large model inference SDK specifically designed for iOS and macOS, allowing developers to run small vision-language and audio-language models completely offline on Apple devices, and providing API interfaces compatible with OpenAI and Anthropic.

## Technical Foundation: Metal Backend Based on llama.cpp

The core technology stack of mobile-model-SDK is built on llama.cpp, a high-performance large model inference library developed by Georgi Gerganov, known for its excellent quantization support and cross-platform capabilities. The SDK specifically uses llama.cpp's `mtmd` multimodal stack, supporting joint processing of text, images, and audio.

In the Apple ecosystem, the SDK fully leverages the Metal backend for GPU acceleration. Metal is Apple's proprietary graphics and computing API, which can efficiently utilize the neural network engine and GPU resources of Apple Silicon chips on iPhone, iPad, and Mac devices. This targeted optimization enables even resource-constrained mobile devices to run multimodal large models smoothly.

## Supported Models and Capability Matrix

mobile-model-SDK currently supports the following models:

**MiniCPM-V 4.6 (1.3B)**：This is an efficient multimodal model developed by OpenBMB (FaceWall Intelligence), with only 1.3B parameters but excellent performance in visual understanding tasks. It is particularly good at OCR (Optical Character Recognition) and UI understanding, and can accurately recognize text content and interface elements in screenshots. This model supports text and image input but does not support audio.

**Gemma 4 E2B / E4B**：This is Google's Gemma 4 series model, supporting three modalities: text, image, and audio. The E2B and E4B variants represent different parameter scales respectively. Gemma 4's native audio support allows it to directly process voice input on the device, enabling speech-to-text conversion and voice-based Q&A.

Notably, the SDK adopts a model-agnostic design architecture. Developers can load any supported GGUF format model, and the SDK will automatically detect the model's capabilities (visual, audio support) and apply the correct conversation template. Adding a new model usually does not require code modification—just place the corresponding GGUF file and mmproj file.

## Fully On-Device Operation

The most prominent feature of the SDK is that all inference is done locally on the device, no network connection required, and no reliance on any cloud services. This means:

- **Privacy Protection**: User's image, audio, and text data never leave the device, which is especially important for applications handling sensitive information (such as medical and financial).
- **Offline Availability**: It can still be used normally in environments without network connection (e.g., airplane mode, remote areas).
- **Zero API Cost**: No need to pay for cloud API calls; once the model is downloaded, it can be used infinitely.

## Multimodal Capabilities

The SDK supports combinations of three input modalities:

**Text**: As the basic modality, all models support text input and generation.

**Visual**: Supports single or multiple image inputs, as well as video frame sequences. Images are encoded into visual tokens and processed together with text tokens. Image tokens are placed before text, in line with Gemma 4's multimodal conventions.

**Audio**: Gemma 4 series models support native voice input. Developers can record 16kHz mono WAV audio and use it as part of the input. Audio tokens are placed after text, in line with Gemma 4's modality order conventions.

## API Compatibility

To lower the barrier for developers to integrate, the SDK provides interfaces compatible with mainstream cloud APIs:

**OpenAI Compatible Mode**: Provides `ChatCompletionRequest` and streaming chunks, consistent with the format of OpenAI's Chat Completions API. Developers familiar with the OpenAI SDK can migrate seamlessly.

**Anthropic Compatible Mode**: Provides Messages API types and streaming events, consistent with the format of Anthropic's Claude API. This provides a familiar interface experience for developers using Claude.