Zing Forum

Reading

Spindll: Rust-native High-Performance LLM Inference Engine, a Lightweight Alternative to Ollama

A single-binary LLM inference engine written in Rust, supporting gRPC and HTTP streaming inference, multi-model concurrency, GPU acceleration, and OpenAI API compatibility.

RustLLM推理Ollama替代gRPCMLXGGUFOpenAI API本地部署GPU加速
Published 2026-05-09 14:13Recent activity 2026-05-09 14:22Estimated read 7 min
Spindll: Rust-native High-Performance LLM Inference Engine, a Lightweight Alternative to Ollama
1

Section 01

Introduction / Main Floor: Spindll: Rust-native High-Performance LLM Inference Engine, a Lightweight Alternative to Ollama

A single-binary LLM inference engine written in Rust, supporting gRPC and HTTP streaming inference, multi-model concurrency, GPU acceleration, and OpenAI API compatibility.

2

Section 02

Background: Evolution of Local LLM Inference Requirements

As large language model technology becomes widespread, more and more developers and enterprises want to deploy and run LLM inference services in local environments. Ollama, as a pioneer in this field, has gained a broad user base with its simple user experience. However, as application scenarios deepen, users have put forward higher requirements for the performance, resource management, and scalability of inference engines.

Rust, with its excellent performance and memory safety, has become an ideal choice for building high-performance system-level applications. The Spindll project is built on this technology stack to create a production-ready LLM inference engine.

3

Section 03

Project Overview

Spindll is a GGUF and MLX format model inference engine natively written in Rust, open-sourced by developer Iito. The project name combines the concepts of "Spindle" and "LL(ama)", reflecting its positioning as a model management and service engine.

As a single-binary solution, Spindll can pull models from the Ollama registry or HuggingFace, manage local storage, and provide streaming inference services via gRPC and HTTP protocols. It supports concurrent loading of multiple models, memory-aware scheduling, GPU hardware acceleration, and provides an OpenAI-compatible API interface.

Notably, on Apple Silicon platforms, Spindll natively runs MLX format models via Swift bridging, while supporting GGUF format through llama.cpp, achieving seamless integration of dual backends.

4

Section 04

Multi-source Model Pulling and Management

Spindll supports fetching models from multiple sources:

  • Ollama Registry: Compatible with Ollama's model naming conventions (e.g., llama3.1:8b, qwen2:0.5b)
  • HuggingFace Repository: Supports HuggingFace formats like TheBloke/Llama-3-8B-GGUF
  • Intelligent Quantization Selection: By default, it prioritizes the q4_k_m quantization version; when no quantization level is specified, it automatically selects according to the priority list (q4_k_m > q5_k_m > q4_0 > ... > fp16)

On Apple Silicon devices, the system automatically detects and prioritizes MLX format models, falling back to GGUF format when MLX is unavailable.

5

Section 05

Flexible Inference Backend Architecture

The project adopts a pluggable inference backend design, with unified scheduling via the InferenceBackend trait:

  • llama.cpp Backend: Supports GGUF format, runs cross-platform, and supports GPU offloading
  • mlx-swift-lm Backend: Integrated via Swift FFI, optimized for Apple Silicon, natively supports MLX format

This architectural design reserves extension points for integrating new inference engines in the future.

6

Section 06

Multi-model Concurrency and Intelligent Memory Management

Spindll supports loading multiple models into memory simultaneously and implements an LRU (Least Recently Used) eviction strategy. When the memory budget is exceeded, the system automatically unloads less frequently used models to ensure the availability of high-priority models.

The memory budget is configurable; by default, the system automatically detects available memory (on macOS, this includes the sum of free memory, inactive memory, purgeable memory, and speculative memory), and users can also manually specify the budget size (e.g., "8G").

7

Section 07

Continuous Batching and KV Caching

The system implements a continuous batching mechanism; concurrent requests for the same model share a single context via sequence IDs, significantly improving throughput.

The KV caching feature supports disk-stored prompt prefix caching, with optional ChaCha20-Poly1305 encryption to protect cached data. This not only accelerates inference speed for repeated prompts but also ensures the security of sensitive data.

8

Section 08

OpenAI-compatible API

Spindll provides interfaces compatible with the OpenAI API, including:

  • /v1/chat/completions: Chat completion interface
  • /v1/completions: Text completion interface
  • Tool/Function Calling: Supports native integration with applications like AnythingLLM and Open WebUI

This compatibility allows existing OpenAI clients to seamlessly migrate to Spindll, lowering the adoption barrier.