Reading

Tenchi-MCP: A Hybrid Cloud-Edge LLM Inference Orchestrator Based on the MCP Protocol

Tenchi-MCP is an open-source hybrid inference orchestration tool that seamlessly integrates cloud-based large language models (LLMs) with local Ollama models via the Model Context Protocol (MCP), enabling intelligent task distribution and balancing cost optimization, data privacy protection, and inference efficiency.

MCPLLMOllama混合推理本地模型云端模型Rust隐私保护成本优化Claude Code

Published 2026-05-19 13:40Recent activity 2026-05-19 13:48Estimated read 6 min

Tenchi-MCP: A Hybrid Cloud-Edge LLM Inference Orchestrator Based on the MCP Protocol

Section 01

Tenchi-MCP Guide: An Open-Source Solution for Hybrid Cloud-Edge LLM Inference Orchestration

Tenchi-MCP is an open-source hybrid inference orchestration tool based on the MCP protocol. By integrating cloud-based large models (such as Gemini and Claude) with local Ollama models, it enables intelligent task distribution and balances cost optimization, data privacy protection, and inference efficiency. Key advantages include zero-intrusion integration with mainstream AI development tools, flexible multi-model role configuration, and offline support.

Section 02

Project Background and Core Contradictions

With the penetration of LLMs into development workflows, developers face a dilemma: cloud-based models are powerful but have high token costs and data privacy risks; local Ollama models are free and data-secure but have inference speed limited by hardware and lack standardized integration interfaces. Tenchi-MCP (Tian Di-MCP), built with Rust, aims to resolve this contradiction by unifying the orchestration of cloud and edge models via the MCP protocol.

Section 03

Technical Architecture and Core Mechanisms

MCP Protocol and Zero-Intrusion Integration

MCP is an open protocol launched by Anthropic that standardizes interactions between AI models and external tools. As an MCP server, Tenchi-MCP supports mainstream tools like Claude Code and Gemini CLI, allowing developers to connect to local models without modifying their existing workflows.

Intelligent Task Distribution Strategy

By configuring the roles and task descriptions of local models via models_config.toml, the cloud proxy can independently decide task routing: sensitive code reviews use local Qwen Coder, general Q&A uses cloud models, balancing security and performance.

Multi-Model Role-Based Configuration

Supports defining roles such as Coder (low temperature to ensure determinism), Expert (moderate temperature to balance creativity and accuracy), and Lite (small context window for resource-constrained environments). Each role can independently set system prompts, sampling parameters, and hardware resource allocation.

Section 04

Practical Application Scenarios and Value

Privacy-Sensitive Development Scenarios

When handling enterprise private code or sensitive data, local inference data is processed only locally, eliminating leakage risks.

Cost Optimization

Delegating simple tasks (code formatting, syntax checking) to local models can save 30%-60% of cloud costs.

Offline Support

Automatically switches to local models when the network is unstable, ensuring uninterrupted development.

Section 05

Installation and Configuration Practice

Installation Methods

Gemini CLI: gemini extensions install https://github.com/DovahkiinYuzuko/Tenchi-MCP --ref v0.1.2
Claude Code/Codex CLI: Clone the repository and compile: git clone https://github.com/DovahkiinYuzuko/Tenchi-MCP && cd Tenchi-MCP && cargo build --release

Configuration File Structure

models_config.toml includes: global configuration (Ollama address, timeout), model definitions (roles, priorities), inference parameters (temperature, etc.), and resource control (GPU layers, CPU threads).

Section 06

Limitations and Notes

Local model inference speed depends on hardware: running a 70B parameter model on a consumer-grade CPU may take tens of seconds.
Cross-platform verification: The current version is mainly verified on Windows 11; macOS and Linux support have not been tested on real machines yet.

Section 07

Summary and Outlook

Tenchi-MCP enables cloud-edge collaboration through intelligent orchestration, providing a practical tool for cost control, privacy protection, and offline availability. As local models (such as Llama3 and Qwen2.5) improve in capability and the MCP ecosystem matures, the hybrid inference model is expected to become mainstream in AI-assisted development. For developers who value data sovereignty and cost, Tenchi-MCP is worth exploring.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15