Core Positioning and Architectural Philosophy
Lumen is designed as an LLM inference control plane, not an inference engine itself. Built on FastAPI, it exposes an OpenAI-compatible HTTP API while routing actual requests to backend self-hosted inference services. The advantages of this layered architecture are: front-end applications can switch from OpenAI to private deployments without modification, and the backend can flexibly select and replace inference engines as needed. The control plane design makes model governance, traffic management, and monitoring more centralized and standardized.
OpenAI-Compatible API Design
Lumen implements core endpoints from the OpenAI API specification, including chat completion, text completion, and embedding generation. This compatibility means existing OpenAI client libraries, SDKs, and third-party tools can interact directly with Lumen without any code modifications. The API supports streaming responses, enabling token-by-token output via the SSE protocol—critical for interactive applications. Additionally, Lumen implements model list and metadata query endpoints, allowing clients to dynamically discover available models.