Zing Forum

Reading

ai-demo1: Complete Local AI Production Stack Reproduction, End-to-End Practice from OAuth to Observability

A local development lab project that fully reproduces a production-grade AI chat stack: OAuth authentication, LLM inference proxy, MCP tool calling, and OpenTelemetry tracing—all running locally.

Pydantic AIOAuthAI GatewayMCPOpenTelemetry微服务LLM推理本地开发可观测性
Published 2026-04-05 13:43Recent activity 2026-04-05 13:50Estimated read 8 min
ai-demo1: Complete Local AI Production Stack Reproduction, End-to-End Practice from OAuth to Observability
1

Section 01

ai-demo1: Guide to Complete Reproduction of Local AI Production Stack

This article introduces the ai-demo1 project—a local development lab that fully reproduces a production-grade AI chat stack, covering OAuth authentication, LLM inference proxy, MCP tool calling, and OpenTelemetry tracing. All services run locally, solving the dilemma developers face: either using cloud-hosted services that are fast but lack system visibility, or building from scratch which has extremely high engineering complexity.

2

Section 02

Project Background: Why Do We Need a Local AI Production Stack

When developing AI applications, developers often face a choice: cloud-hosted services are fast but lack system visibility, while building from scratch has extremely high engineering complexity. Production AI systems involve components like authentication and authorization, model inference proxy, tool calling, and tracing—integrating and debugging these requires repeated deployment to remote environments. ai-demo1 provides a local solution where all services run on localhost (no external dependencies except for the xAI API key), allowing developers to understand, debug, and extend each component in a controlled environment.

3

Section 03

Architecture and Technical Approach

ai-demo1 uses a microservices architecture with 4 core services:

  1. oauth-idp (port 9000): A custom OAuth2 identity provider that implements the Authorization Code + PKCE flow and issues JWTs using RS256.
  2. chat-back (port 8100): An AI inference proxy that provides an OpenAI-compatible API and routes requests to upstream LLMs like xAI/Copilot.
  3. mcp-gw (port 8200): An MCP tool gateway that provides mock tools (e.g., weather query) to test tool calling flows.
  4. chat-front (port 8300): A chat agent based on Pydantic AI, responsible for OAuth authentication, calling chat-back, and executing MCP tool calls. Technology stack: Python3.12 + uv (blazing-fast package management), FastAPI + uvicorn (HTTP services), Pydantic AI (agent building), MCP protocol (tool interaction), OpenTelemetry (tracing).
4

Section 04

Operation Flow and Test Evidence

Complete Request Lifecycle:

  1. User Authentication: chat-front initiates the PKCE flow to obtain an access token.
  2. Chat Request: The user inputs a question, and chat-front constructs a request and sends it to chat-back.
  3. Model Routing: chat-back forwards the request to the corresponding LLM provider based on the model prefix.
  4. Tool Calling Decision: The LLM determines whether to call a tool (e.g., weather query), outputs a tool_call request specifying the get_weather tool, and provides parameters like {"location": "Beijing"}.
  5. Tool Execution: chat-front interacts with mcp-gw via MCP to get results such as {"temperature":22, "condition":"sunny"}.
  6. Final Response: chat-front appends the tool result to the conversation, and the LLM generates a natural language reply.
  7. Tracing: Each step generates OTel data, which is visualized in Grafana Tempo. Local Development Workflow: Use launch.sh to start/stop services, check status/logs. Testing System: Unit tests (pytest) cover all services; integration tests (21 scenarios) verify end-to-end flows (OAuth, inference, tool calling, etc.).
5

Section 05

Engineering Value and Insights

The value of ai-demo1:

  • Production Preview: Provides a production-equivalent local environment to validate architecture, test failures, optimize performance, and experiment with technology choices.
  • Educational Significance: Demonstrates AI system component decomposition, interaction protocols, Python async best practices, and observability implementation.
  • Extension Foundation: Modular design supports adding new tools, LLM providers, authentication schemes, or front-end interfaces.
6

Section 06

Limitations and Future Directions

Current Limitations:

  • Lack of Persistent Storage: User data/session state is stored in memory and lost on restart.
  • Insufficient Horizontal Scaling: Runs as a single process, no consideration for multi-instance deployment.
  • Security Configuration to Be Optimized: Self-signed certificates and hard-coded keys need to be replaced with production-grade solutions.
  • Missing Monitoring and Alerts: Only tracing is available; lacks metric collection and alerting mechanisms. Future Directions: Address the above production gaps to help developers move from demo to production.
7

Section 07

Conclusion: Local-First AI Development Paradigm

ai-demo1 represents the "local-first" AI development paradigm: before deploying to the cloud, build, test, and understand the system in a controlled local environment. This paradigm is particularly important today as AI engineering complexity grows—it allows developers to become masters of their systems rather than being constrained by black-box services. It is recommended for developers who want to dive deep into AI system architecture or plan AI infrastructure to study this reference implementation.