Reading

ai-demo1: Complete Local AI Production Stack Reproduction, End-to-End Practice from OAuth to Observability

A local development lab project that fully reproduces a production-grade AI chat stack: OAuth authentication, LLM inference proxy, MCP tool calling, and OpenTelemetry tracing—all running locally.

Pydantic AIOAuthAI GatewayMCPOpenTelemetry微服务LLM推理本地开发可观测性

Published 2026-04-05 13:43Recent activity 2026-04-05 13:50Estimated read 8 min

ai-demo1: Complete Local AI Production Stack Reproduction, End-to-End Practice from OAuth to Observability

Section 01

ai-demo1: Guide to Complete Reproduction of Local AI Production Stack

This article introduces the ai-demo1 project—a local development lab that fully reproduces a production-grade AI chat stack, covering OAuth authentication, LLM inference proxy, MCP tool calling, and OpenTelemetry tracing. All services run locally, solving the dilemma developers face: either using cloud-hosted services that are fast but lack system visibility, or building from scratch which has extremely high engineering complexity.

Section 02

Project Background: Why Do We Need a Local AI Production Stack

When developing AI applications, developers often face a choice: cloud-hosted services are fast but lack system visibility, while building from scratch has extremely high engineering complexity. Production AI systems involve components like authentication and authorization, model inference proxy, tool calling, and tracing—integrating and debugging these requires repeated deployment to remote environments. ai-demo1 provides a local solution where all services run on localhost (no external dependencies except for the xAI API key), allowing developers to understand, debug, and extend each component in a controlled environment.

Section 03

Architecture and Technical Approach

ai-demo1 uses a microservices architecture with 4 core services:

oauth-idp (port 9000): A custom OAuth2 identity provider that implements the Authorization Code + PKCE flow and issues JWTs using RS256.
chat-back (port 8100): An AI inference proxy that provides an OpenAI-compatible API and routes requests to upstream LLMs like xAI/Copilot.
mcp-gw (port 8200): An MCP tool gateway that provides mock tools (e.g., weather query) to test tool calling flows.
chat-front (port 8300): A chat agent based on Pydantic AI, responsible for OAuth authentication, calling chat-back, and executing MCP tool calls. Technology stack: Python3.12 + uv (blazing-fast package management), FastAPI + uvicorn (HTTP services), Pydantic AI (agent building), MCP protocol (tool interaction), OpenTelemetry (tracing).

Section 04

Operation Flow and Test Evidence

Complete Request Lifecycle:

User Authentication: chat-front initiates the PKCE flow to obtain an access token.
Chat Request: The user inputs a question, and chat-front constructs a request and sends it to chat-back.
Model Routing: chat-back forwards the request to the corresponding LLM provider based on the model prefix.
Tool Calling Decision: The LLM determines whether to call a tool (e.g., weather query), outputs a tool_call request specifying the get_weather tool, and provides parameters like {"location": "Beijing"}.
Tool Execution: chat-front interacts with mcp-gw via MCP to get results such as {"temperature":22, "condition":"sunny"}.
Final Response: chat-front appends the tool result to the conversation, and the LLM generates a natural language reply.
Tracing: Each step generates OTel data, which is visualized in Grafana Tempo. Local Development Workflow: Use launch.sh to start/stop services, check status/logs. Testing System: Unit tests (pytest) cover all services; integration tests (21 scenarios) verify end-to-end flows (OAuth, inference, tool calling, etc.).

Section 05

Engineering Value and Insights

The value of ai-demo1:

Production Preview: Provides a production-equivalent local environment to validate architecture, test failures, optimize performance, and experiment with technology choices.
Educational Significance: Demonstrates AI system component decomposition, interaction protocols, Python async best practices, and observability implementation.
Extension Foundation: Modular design supports adding new tools, LLM providers, authentication schemes, or front-end interfaces.

Section 06

Limitations and Future Directions

Current Limitations:

Lack of Persistent Storage: User data/session state is stored in memory and lost on restart.
Insufficient Horizontal Scaling: Runs as a single process, no consideration for multi-instance deployment.
Security Configuration to Be Optimized: Self-signed certificates and hard-coded keys need to be replaced with production-grade solutions.
Missing Monitoring and Alerts: Only tracing is available; lacks metric collection and alerting mechanisms. Future Directions: Address the above production gaps to help developers move from demo to production.

Section 07

Conclusion: Local-First AI Development Paradigm

ai-demo1 represents the "local-first" AI development paradigm: before deploying to the cloud, build, test, and understand the system in a controlled local environment. This paradigm is particularly important today as AI engineering complexity grows—it allows developers to become masters of their systems rather than being constrained by black-box services. It is recommended for developers who want to dive deep into AI system architecture or plan AI infrastructure to study this reference implementation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15