Zing Forum

Reading

Duplex: A Local-First Multi-Model Parallel Inference Engine

A privacy-first client application that supports simultaneous connections to local Ollama and multiple cloud-based large model APIs, enabling true parallel inference and real-time comparison.

LLM多模型推理Ollama隐私优先ReactTypeScript开源工具AI开发
Published 2026-06-07 23:27Recent activity 2026-06-07 23:52Estimated read 10 min
Duplex: A Local-First Multi-Model Parallel Inference Engine
1

Section 01

Duplex: Introduction to the Local-First Multi-Model Parallel Inference Engine

Duplex is a local-first multi-model parallel inference engine that supports simultaneous connections to local Ollama and multiple cloud-based large model APIs, enabling true parallel inference and real-time comparison. Developed and maintained by Ryuk1811, this project is open-sourced on GitHub (link: https://github.com/Ryuk1811/Duplex) under the MIT License. Its core philosophy is privacy-first: all application states are persisted locally via localStorage, with no external databases or telemetry tracking, and user conversation data remains entirely local. Duplex addresses the dilemma developers face between the privacy of local models and the performance of cloud models, as well as the time-consuming pain point of traditional model testing one by one, providing an efficient tool for scenarios like model selection and prompt engineering.

2

Section 02

Background: Why Do We Need Multi-Model Parallel Inference?

When using large language models, developers often face a dilemma: choose local models to protect privacy, or use cloud APIs to get stronger performance? Different models perform differently in specific tasks (code generation, logical reasoning, creative writing, etc.). The traditional process requires testing model outputs one by one, which is time-consuming and makes it difficult to compare horizontally. Duplex was created to solve this pain point, allowing developers to send the same prompt to multiple models simultaneously and observe response differences in real time on a unified interface.

3

Section 03

Project Overview: What Is Duplex?

Duplex is an offline-first multiplexed large language model inference engine that allows engineers and researchers to run parallel real-time prompt tests simultaneously, supporting both locally hosted models (e.g., Ollama, LM Studio, vLLM) and cloud models (e.g., OpenAI, Anthropic, Gemini, Groq). Its core philosophy is 'privacy-first': all configurations (model selection, theme, layout) are stored in the browser's localStorage, with no backend services. The application can run offline, and only explicitly sent cloud requests will leave the device.

4

Section 04

Core Features and Technical Highlights

True Multiplexed Inference

Supports simultaneous streaming of inference results from up to three AI models, with side-by-side output viewing, facilitating model selection, prompt engineering, and performance benchmarking.

Fully Private Local State

No dependency on backend services; all configurations are stored in localStorage, protecting privacy and supporting offline operation.

Cross-Platform Compatibility

Can connect to local instances (e.g., Ollama) or cloud providers (via API keys), and supports custom endpoints in OpenAI standard format (e.g., Perplexity).

Real-Time Diagnostic Engine

Built-in real-time rendering of performance metrics, including Time to First Token (TTFT) and Throughput (TPS), to quantitatively evaluate model response speed.

Modular Rendering Layout

Provides view modes such as side-by-side comparison, responsive scaling, and Markdown/simplified rendering switching, flexibly adapting to needs.

5

Section 05

Technical Architecture Analysis

Duplex采用现代前端技术栈构建:

Component Technology Purpose
Framework React 18 + Vite Core execution environment
Language TypeScript Strongly typed logic layer
Styling Tailwind CSS Responsive UI
Routing React Router DOM Client-side routing
Animation Motion (Framer Motion) Smooth visual transitions
Storage LocalStorage Client-side persistence

The technology selection reflects a focus on performance and user experience: Vite enables fast development, TypeScript ensures code quality, Tailwind CSS allows flexible styling, and Framer Motion adds animation effects.

6

Section 06

Use Cases and Practical Value

Model Selection and Evaluation

When selecting a model for a specific scenario, you can test candidate models simultaneously with a set of prompts, compare output quality, response speed, and cost to assist decision-making.

Prompt Engineering Optimization

Instantly view performance differences of the same prompt across different models, adjust the prompt structure targeted, and obtain more consistent and high-quality outputs.

Hybrid Local and Cloud Deployment

Compare the performance of local and cloud models to determine which tasks can be handled locally and which need to call cloud APIs, balancing privacy and capability.

Teaching and Demonstration

The side-by-side comparison view is suitable for teaching, helping students understand model characteristics; it can also be used as a technical demonstration tool to show AI diversity to non-technical personnel.

7

Section 07

Deployment and Usage Guide

Duplex is optimized for Netlify edge delivery, with deployment steps as follows:

  1. Clone the repository and install dependencies
  2. Run npm run dev to start the development server
  3. To use local Ollama, configure OLLAMA_ORIGINS="*" ollama serve (to resolve cross-origin issues)
  4. Push to GitHub and import into Netlify for automatic deployment

The project documentation emphasizes the importance of configuring cross-origin requests, reflecting a focus on security.

8

Section 08

Summary and Outlook

Duplex represents an important direction in the development of AI tools: enjoying the capabilities of large models while maintaining full control over data. Its multi-model parallel inference capability improves development efficiency and provides a scientific basis for model evaluation.

As local models (e.g., Llama, Mistral) become more capable and cloud APIs become more abundant, the value of Duplex becomes increasingly prominent, allowing developers to flexibly combine local and cloud models instead of choosing one over the other.

For technical personnel interested in AI application development, prompt engineering, or model evaluation, Duplex is an open-source project worth exploring and contributing to.