Zing Forum

Reading

Tri-Tier Private AI Architecture: Enabling Secure Integration of Local and Cloud Intelligence with Zero Public Network Exposure

tri-tier-private-ai is a self-hosted privacy-first AI stack that uses a keyword routing mechanism to direct sensitive prompts to local models and complex reasoning tasks to the cloud, while ensuring zero public network exposure. This project provides enterprise-grade privacy protection solutions for individuals and small teams at a cost of approximately $8-12 per month.

隐私保护本地AI云端路由关键词过滤零数据保留TailscaleOllamaLiteLLM自托管分层架构
Published 2026-04-18 12:08Recent activity 2026-04-18 12:23Estimated read 8 min
Tri-Tier Private AI Architecture: Enabling Secure Integration of Local and Cloud Intelligence with Zero Public Network Exposure
1

Section 01

Tri-Tier Private AI Architecture: Enabling Secure Integration of Local and Cloud Intelligence with Zero Public Network Exposure

tri-tier-private-ai is a self-hosted privacy-first AI stack that uses a keyword routing mechanism to direct sensitive prompts to local models and complex reasoning tasks to the cloud, while ensuring zero public network exposure. This project provides enterprise-grade privacy protection solutions for individuals and small teams at a cost of approximately $8-12 per month, resolving the dilemma between the privacy of local models and the intelligence of cloud models.

2

Section 02

Background: The Dilemma Between Privacy and Intelligence

In large language model applications, users face a fundamental dilemma: local models ensure privacy but sacrifice intelligence; cloud APIs provide powerful reasoning but require entrusting sensitive data. tri-tier-private-ai proposes a tri-tier architecture intelligent routing system that allows users to balance local privacy and cloud intelligence in the same workflow, with self-hosting costs controlled at approximately $8-12 per month.

3

Section 03

Methodology: Tri-Tier Architecture and Core Components

Core insight of the project: Different prompts require different processing levels. Sensitive content is handled locally, while complex tasks are routed to the cloud. The architecture consists of four layers:

  • Control Layer (OpenClaw):Orchestrator and UI, responsible for task distribution, interacting via Tailscale private network with zero public network exposure.
  • Routing Layer (LiteLLM):Open-source model routing proxy, which decides the prompt processing path based on keyword rules, a zero-cost key component.
  • Private Layer (Ollama + Gemma4 E4B):Runs a local model with approximately 4 billion parameters (4-bit quantization occupies 3.8GB of resources), handles daily conversations and sensitive data, and data never leaves the VPS.
  • Intelligence Layer (Together AI Qwen-2.5-72B):72 billion parameter model with 128K context, supports Zero Data Retention (ZDR), and handles non-sensitive complex tasks.
4

Section 04

Methodology: Keyword Interception Logic for Sensitive Content

The core of the system's privacy protection is the keyword interception logic defined in router_hook.py. Default keywords cover categories such as finance/tax, identity/PII, documents, credentials, medical, legal, privacy markers (e.g., tax, ssn, password, medical). When a prompt is submitted, LiteLLM scans the content: if it contains sensitive keywords, it is redirected to the local Ollama; otherwise, it is sent to Together AI, achieving hard blocking of sensitive data.

5

Section 05

Security Measures: Multi-Layer Network Isolation and Zero Data Retention

The project adopts multi-layer security strategies:

  • Firewall: UFW defaults to denying inbound traffic, only allowing SSH and Tailscale traffic.
  • Container Isolation: Ollama and LiteLLM are bound to 127.0.0.1, listening only on the local loopback.
  • Tailscale Private Network: All access is via an encrypted mesh network, with the internal IP as the only entry point.
  • Zero Data Retention: Together AI account-level ZDR settings disable prompt storage and training; the system reinforces this protection via the X-Together-No-Store request header.
6

Section 06

Deployment Process and Cost Analysis

Deployment Process: Requires an Ubuntu 22.04 VPS (at least 4GB RAM), install Docker and Tailscale; configure the .env file (LiteLLM master key, Together AI API key); start the service and pull the Gemma4 E4B model; configure OpenClaw to use the LiteLLM endpoint and enable Together AI ZDR. Cost: Hetzner CX21 VPS is approximately $10/month; Together AI charges $0.9 per million tokens for both input and output; open-source components are zero-cost. For moderate usage (500,000 tokens/month), the total cost is approximately $10-12/month.

7

Section 07

Testing & Validation and Extension Customization

Testing & Validation:

  • Sensitive content test: A curl request containing "my tax file is private" should show sensitive keyword detection in the logs.
  • Non-sensitive test: A request to explain the transformer mechanism should be routed to Together AI. Extension Customization:
  • Custom keywords: Edit PRIVATE_KEYWORDS in router_hook.py and restart LiteLLM.
  • Model replacement: Ollama supports multiple local models; LiteLLM supports over 100 cloud providers.
  • Custom routing logic: Modify router_hook.py to implement complex strategies (e.g., user identity, request frequency).
8

Section 08

Limitations and Future Improvement Directions

Limitations:

  • Keyword routing is not perfect; it may miss detection or be bypassed, and more complex detection is needed for high-security scenarios.
  • Local models have limited capabilities; complex tasks still lag behind cloud large models. Future Directions: Introduce intelligent content classification models, automatic selection of multiple local models, audit logs and compliance reports, and a more user-friendly management interface.