Reading

Server Nexe: A Complete Solution for Localized AI Servers

Server Nexe is a fully locally-run AI server with persistent memory, RAG retrieval, and multi-backend inference capabilities, ensuring users' conversations, documents, and model weights remain entirely on local devices.

本地AI隐私保护RAGMLXOllama向量数据库开源项目

Published 2026-04-17 08:39Recent activity 2026-04-17 08:50Estimated read 6 min

Section 01

Introduction / Main Post: Server Nexe: A Complete Solution for Localized AI Servers

Section 02

Project Origin and Philosophy

Server Nexe started with a simple yet profound question: "What does it take to have a local AI with persistent memory?" Since the author didn't plan to build an LLM from scratch, they began collecting various components to assemble a tool useful for their daily work.

The uniqueness of this project lies in its development approach— the entire project (code, testing, auditing, documentation) is co-completed by one person orchestrating different AI models, including local models (MLX, Ollama) and cloud models (Claude, GPT, Gemini, DeepSeek, Qwen, Grok). Humans are responsible for deciding what to build, designing the architecture, reviewing code, and running tests, while AI writes, audits, and stress-tests under human guidance.

From the initial experimental prototype, the project gradually evolved into a truly useful product: 4842 tests (about 85% coverage), security audits, static encryption, a macOS installer with hardware detection, and a plugin system.

Section 03

1. Zero Data Leakage

This is the most prominent feature of Server Nexe. All conversations, documents, embedding vectors, and model weights remain on the user's machine. No telemetry data, no external calls, no cloud dependencies— not even a server for monitoring.

Section 04

2. Persistent Memory System

Server Nexe uses Qdrant vector search, combined with 768-dimensional embedding vectors, to store memory in three dedicated collections. The system can:

Automatically extract facts from conversations (names, jobs, preferences, projects)
Store information into memory within the same LLM call with zero additional latency
Support three-language intent detection (Catalan/Spanish/English)
Semantic deduplication and voice deletion ("Forget that...")

Section 05

3. Multi-Backend Inference Support

Users can freely switch between three inference backends by simply modifying the configuration file:

Backend	Platform	Best Use Case
MLX	macOS (Apple Silicon)	Recommended for Mac— native Metal GPU acceleration, fastest on M-series chips
llama.cpp	macOS / Linux	General purpose— GGUF format, supports Metal on Mac, CPU/CUDA on Linux
Ollama	macOS / Linux	Bridge existing Ollama installations, simplest model management

Section 06

4. Intelligent Model Recommendation

The installer automatically organizes 16 catalog models into 4 tiers based on the machine's available RAM:

8 GB Tier: Gemma 3 4B, Qwen3.5 4B, Qwen3 4B
16 GB Tier: Gemma4 E4B, Salamandra7B, Qwen3.5 9B, Gemma3 12B
24 GB Tier: Gemma4 31B, Qwen3 14B, GPT-OSS20B
32 GB Tier: Qwen3.527B, Gemma327B, DeepSeek R132B, Qwen3.535B-A3B, ALIA-40B

Additionally, users can use any Ollama model by name, or any GGUF model from Hugging Face.

Section 07

5. Modular Plugin System

Server Nexe uses an auto-discovery plugin architecture— security, Web UI, RAG, backend— everything is a plugin. Through the NexeModule protocol and duck typing (no inheritance required), users can add features without touching the core code.

Section 08

6. RAG Document Processing

Users can upload .txt, .md, or .pdf files, and the system will automatically index them for RAG. Each document is only visible in the session it was uploaded to— no cross-contamination between sessions.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15