Reading

Spindll: Rust-native High-Performance LLM Inference Engine, a Lightweight Alternative to Ollama

A single-binary LLM inference engine written in Rust, supporting gRPC and HTTP streaming inference, multi-model concurrency, GPU acceleration, and OpenAI API compatibility.

RustLLM推理Ollama替代gRPCMLXGGUFOpenAI API本地部署GPU加速

Published 2026-05-09 14:13Recent activity 2026-05-09 14:22Estimated read 7 min

Section 01

Introduction / Main Floor: Spindll: Rust-native High-Performance LLM Inference Engine, a Lightweight Alternative to Ollama

A single-binary LLM inference engine written in Rust, supporting gRPC and HTTP streaming inference, multi-model concurrency, GPU acceleration, and OpenAI API compatibility.

Section 02

Background: Evolution of Local LLM Inference Requirements

As large language model technology becomes widespread, more and more developers and enterprises want to deploy and run LLM inference services in local environments. Ollama, as a pioneer in this field, has gained a broad user base with its simple user experience. However, as application scenarios deepen, users have put forward higher requirements for the performance, resource management, and scalability of inference engines.

Rust, with its excellent performance and memory safety, has become an ideal choice for building high-performance system-level applications. The Spindll project is built on this technology stack to create a production-ready LLM inference engine.

Section 03

Project Overview

Spindll is a GGUF and MLX format model inference engine natively written in Rust, open-sourced by developer Iito. The project name combines the concepts of "Spindle" and "LL(ama)", reflecting its positioning as a model management and service engine.

As a single-binary solution, Spindll can pull models from the Ollama registry or HuggingFace, manage local storage, and provide streaming inference services via gRPC and HTTP protocols. It supports concurrent loading of multiple models, memory-aware scheduling, GPU hardware acceleration, and provides an OpenAI-compatible API interface.

Notably, on Apple Silicon platforms, Spindll natively runs MLX format models via Swift bridging, while supporting GGUF format through llama.cpp, achieving seamless integration of dual backends.

Section 04

Multi-source Model Pulling and Management

Spindll supports fetching models from multiple sources:

Ollama Registry: Compatible with Ollama's model naming conventions (e.g., llama3.1:8b, qwen2:0.5b)
HuggingFace Repository: Supports HuggingFace formats like TheBloke/Llama-3-8B-GGUF
Intelligent Quantization Selection: By default, it prioritizes the q4_k_m quantization version; when no quantization level is specified, it automatically selects according to the priority list (q4_k_m > q5_k_m > q4_0 > ... > fp16)

On Apple Silicon devices, the system automatically detects and prioritizes MLX format models, falling back to GGUF format when MLX is unavailable.

Section 05

Flexible Inference Backend Architecture

The project adopts a pluggable inference backend design, with unified scheduling via the InferenceBackend trait:

llama.cpp Backend: Supports GGUF format, runs cross-platform, and supports GPU offloading
mlx-swift-lm Backend: Integrated via Swift FFI, optimized for Apple Silicon, natively supports MLX format

This architectural design reserves extension points for integrating new inference engines in the future.

Section 06

Multi-model Concurrency and Intelligent Memory Management

Spindll supports loading multiple models into memory simultaneously and implements an LRU (Least Recently Used) eviction strategy. When the memory budget is exceeded, the system automatically unloads less frequently used models to ensure the availability of high-priority models.

The memory budget is configurable; by default, the system automatically detects available memory (on macOS, this includes the sum of free memory, inactive memory, purgeable memory, and speculative memory), and users can also manually specify the budget size (e.g., "8G").

Section 07

Continuous Batching and KV Caching

The system implements a continuous batching mechanism; concurrent requests for the same model share a single context via sequence IDs, significantly improving throughput.

The KV caching feature supports disk-stored prompt prefix caching, with optional ChaCha20-Poly1305 encryption to protect cached data. This not only accelerates inference speed for repeated prompts but also ensures the security of sensitive data.

Section 08

OpenAI-compatible API

Spindll provides interfaces compatible with the OpenAI API, including:

/v1/chat/completions: Chat completion interface
/v1/completions: Text completion interface
Tool/Function Calling: Supports native integration with applications like AnythingLLM and Open WebUI

This compatibility allows existing OpenAI clients to seamlessly migrate to Spindll, lowering the adoption barrier.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15