Reading

Tri-Tier Private AI Architecture: Enabling Secure Integration of Local and Cloud Intelligence with Zero Public Network Exposure

tri-tier-private-ai is a self-hosted privacy-first AI stack that uses a keyword routing mechanism to direct sensitive prompts to local models and complex reasoning tasks to the cloud, while ensuring zero public network exposure. This project provides enterprise-grade privacy protection solutions for individuals and small teams at a cost of approximately $8-12 per month.

隐私保护本地AI云端路由关键词过滤零数据保留TailscaleOllamaLiteLLM自托管分层架构

Published 2026-04-18 12:08Recent activity 2026-04-18 12:23Estimated read 8 min

Section 01

Tri-Tier Private AI Architecture: Enabling Secure Integration of Local and Cloud Intelligence with Zero Public Network Exposure

Section 02

Background: The Dilemma Between Privacy and Intelligence

In large language model applications, users face a fundamental dilemma: local models ensure privacy but sacrifice intelligence; cloud APIs provide powerful reasoning but require entrusting sensitive data. tri-tier-private-ai proposes a tri-tier architecture intelligent routing system that allows users to balance local privacy and cloud intelligence in the same workflow, with self-hosting costs controlled at approximately $8-12 per month.

Section 03

Methodology: Tri-Tier Architecture and Core Components

Core insight of the project: Different prompts require different processing levels. Sensitive content is handled locally, while complex tasks are routed to the cloud. The architecture consists of four layers:

Control Layer (OpenClaw)：Orchestrator and UI, responsible for task distribution, interacting via Tailscale private network with zero public network exposure.
Routing Layer (LiteLLM)：Open-source model routing proxy, which decides the prompt processing path based on keyword rules, a zero-cost key component.
Private Layer (Ollama + Gemma4 E4B)：Runs a local model with approximately 4 billion parameters (4-bit quantization occupies 3.8GB of resources), handles daily conversations and sensitive data, and data never leaves the VPS.
Intelligence Layer (Together AI Qwen-2.5-72B)：72 billion parameter model with 128K context, supports Zero Data Retention (ZDR), and handles non-sensitive complex tasks.

Section 04

Methodology: Keyword Interception Logic for Sensitive Content

The core of the system's privacy protection is the keyword interception logic defined in router_hook.py. Default keywords cover categories such as finance/tax, identity/PII, documents, credentials, medical, legal, privacy markers (e.g., tax, ssn, password, medical). When a prompt is submitted, LiteLLM scans the content: if it contains sensitive keywords, it is redirected to the local Ollama; otherwise, it is sent to Together AI, achieving hard blocking of sensitive data.

Section 05

Security Measures: Multi-Layer Network Isolation and Zero Data Retention

The project adopts multi-layer security strategies:

Firewall: UFW defaults to denying inbound traffic, only allowing SSH and Tailscale traffic.
Container Isolation: Ollama and LiteLLM are bound to 127.0.0.1, listening only on the local loopback.
Tailscale Private Network: All access is via an encrypted mesh network, with the internal IP as the only entry point.
Zero Data Retention: Together AI account-level ZDR settings disable prompt storage and training; the system reinforces this protection via the X-Together-No-Store request header.

Section 06

Deployment Process and Cost Analysis

Deployment Process: Requires an Ubuntu 22.04 VPS (at least 4GB RAM), install Docker and Tailscale; configure the .env file (LiteLLM master key, Together AI API key); start the service and pull the Gemma4 E4B model; configure OpenClaw to use the LiteLLM endpoint and enable Together AI ZDR. Cost: Hetzner CX21 VPS is approximately $10/month; Together AI charges $0.9 per million tokens for both input and output; open-source components are zero-cost. For moderate usage (500,000 tokens/month), the total cost is approximately $10-12/month.

Section 07

Testing & Validation and Extension Customization

Testing & Validation:

Sensitive content test: A curl request containing "my tax file is private" should show sensitive keyword detection in the logs.
Non-sensitive test: A request to explain the transformer mechanism should be routed to Together AI. Extension Customization:
Custom keywords: Edit PRIVATE_KEYWORDS in router_hook.py and restart LiteLLM.
Model replacement: Ollama supports multiple local models; LiteLLM supports over 100 cloud providers.
Custom routing logic: Modify router_hook.py to implement complex strategies (e.g., user identity, request frequency).

Section 08

Limitations and Future Improvement Directions

Limitations:

Keyword routing is not perfect; it may miss detection or be bypassed, and more complex detection is needed for high-security scenarios.
Local models have limited capabilities; complex tasks still lag behind cloud large models. Future Directions: Introduce intelligent content classification models, automatic selection of multiple local models, audit logs and compliance reports, and a more user-friendly management interface.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15