Zing Forum

Reading

Server Nexe: A Complete Solution for Localized AI Servers

Server Nexe is a fully locally-run AI server with persistent memory, RAG retrieval, and multi-backend inference capabilities, ensuring users' conversations, documents, and model weights remain entirely on local devices.

本地AI隐私保护RAGMLXOllama向量数据库开源项目
Published 2026-04-17 08:39Recent activity 2026-04-17 08:50Estimated read 6 min
Server Nexe: A Complete Solution for Localized AI Servers
1

Section 01

Introduction / Main Post: Server Nexe: A Complete Solution for Localized AI Servers

Server Nexe is a fully locally-run AI server with persistent memory, RAG retrieval, and multi-backend inference capabilities, ensuring users' conversations, documents, and model weights remain entirely on local devices.

2

Section 02

Project Origin and Philosophy

Server Nexe started with a simple yet profound question: "What does it take to have a local AI with persistent memory?" Since the author didn't plan to build an LLM from scratch, they began collecting various components to assemble a tool useful for their daily work.

The uniqueness of this project lies in its development approach— the entire project (code, testing, auditing, documentation) is co-completed by one person orchestrating different AI models, including local models (MLX, Ollama) and cloud models (Claude, GPT, Gemini, DeepSeek, Qwen, Grok). Humans are responsible for deciding what to build, designing the architecture, reviewing code, and running tests, while AI writes, audits, and stress-tests under human guidance.

From the initial experimental prototype, the project gradually evolved into a truly useful product: 4842 tests (about 85% coverage), security audits, static encryption, a macOS installer with hardware detection, and a plugin system.

3

Section 03

1. Zero Data Leakage

This is the most prominent feature of Server Nexe. All conversations, documents, embedding vectors, and model weights remain on the user's machine. No telemetry data, no external calls, no cloud dependencies— not even a server for monitoring.

4

Section 04

2. Persistent Memory System

Server Nexe uses Qdrant vector search, combined with 768-dimensional embedding vectors, to store memory in three dedicated collections. The system can:

  • Automatically extract facts from conversations (names, jobs, preferences, projects)
  • Store information into memory within the same LLM call with zero additional latency
  • Support three-language intent detection (Catalan/Spanish/English)
  • Semantic deduplication and voice deletion ("Forget that...")
5

Section 05

3. Multi-Backend Inference Support

Users can freely switch between three inference backends by simply modifying the configuration file:

Backend Platform Best Use Case
MLX macOS (Apple Silicon) Recommended for Mac— native Metal GPU acceleration, fastest on M-series chips
llama.cpp macOS / Linux General purpose— GGUF format, supports Metal on Mac, CPU/CUDA on Linux
Ollama macOS / Linux Bridge existing Ollama installations, simplest model management
6

Section 06

4. Intelligent Model Recommendation

The installer automatically organizes 16 catalog models into 4 tiers based on the machine's available RAM:

  • 8 GB Tier: Gemma 3 4B, Qwen3.5 4B, Qwen3 4B
  • 16 GB Tier: Gemma4 E4B, Salamandra7B, Qwen3.5 9B, Gemma3 12B
  • 24 GB Tier: Gemma4 31B, Qwen3 14B, GPT-OSS20B
  • 32 GB Tier: Qwen3.527B, Gemma327B, DeepSeek R132B, Qwen3.535B-A3B, ALIA-40B

Additionally, users can use any Ollama model by name, or any GGUF model from Hugging Face.

7

Section 07

5. Modular Plugin System

Server Nexe uses an auto-discovery plugin architecture— security, Web UI, RAG, backend— everything is a plugin. Through the NexeModule protocol and duck typing (no inheritance required), users can add features without touching the core code.

8

Section 08

6. RAG Document Processing

Users can upload .txt, .md, or .pdf files, and the system will automatically index them for RAG. Each document is only visible in the session it was uploaded to— no cross-contamination between sessions.