# Duplex: A Local-First Multi-Model Parallel Inference Engine

> A privacy-first client application that supports simultaneous connections to local Ollama and multiple cloud-based large model APIs, enabling true parallel inference and real-time comparison.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-07T15:27:34.000Z
- 最近活动: 2026-06-07T15:52:14.751Z
- 热度: 159.6
- 关键词: LLM, 多模型推理, Ollama, 隐私优先, React, TypeScript, 开源工具, AI开发
- 页面链接: https://www.zingnex.cn/en/forum/thread/duplex
- Canonical: https://www.zingnex.cn/forum/thread/duplex
- Markdown 来源: floors_fallback

---

## Duplex: Introduction to the Local-First Multi-Model Parallel Inference Engine

Duplex is a local-first multi-model parallel inference engine that supports simultaneous connections to local Ollama and multiple cloud-based large model APIs, enabling true parallel inference and real-time comparison. Developed and maintained by Ryuk1811, this project is open-sourced on GitHub (link: https://github.com/Ryuk1811/Duplex) under the MIT License. Its core philosophy is privacy-first: all application states are persisted locally via localStorage, with no external databases or telemetry tracking, and user conversation data remains entirely local. Duplex addresses the dilemma developers face between the privacy of local models and the performance of cloud models, as well as the time-consuming pain point of traditional model testing one by one, providing an efficient tool for scenarios like model selection and prompt engineering.

## Background: Why Do We Need Multi-Model Parallel Inference?

When using large language models, developers often face a dilemma: choose local models to protect privacy, or use cloud APIs to get stronger performance? Different models perform differently in specific tasks (code generation, logical reasoning, creative writing, etc.). The traditional process requires testing model outputs one by one, which is time-consuming and makes it difficult to compare horizontally. Duplex was created to solve this pain point, allowing developers to send the same prompt to multiple models simultaneously and observe response differences in real time on a unified interface.

## Project Overview: What Is Duplex?

Duplex is an offline-first multiplexed large language model inference engine that allows engineers and researchers to run parallel real-time prompt tests simultaneously, supporting both locally hosted models (e.g., Ollama, LM Studio, vLLM) and cloud models (e.g., OpenAI, Anthropic, Gemini, Groq). Its core philosophy is 'privacy-first': all configurations (model selection, theme, layout) are stored in the browser's localStorage, with no backend services. The application can run offline, and only explicitly sent cloud requests will leave the device.

## Core Features and Technical Highlights

### True Multiplexed Inference
Supports simultaneous streaming of inference results from up to three AI models, with side-by-side output viewing, facilitating model selection, prompt engineering, and performance benchmarking.

### Fully Private Local State
No dependency on backend services; all configurations are stored in localStorage, protecting privacy and supporting offline operation.

### Cross-Platform Compatibility
Can connect to local instances (e.g., Ollama) or cloud providers (via API keys), and supports custom endpoints in OpenAI standard format (e.g., Perplexity).

### Real-Time Diagnostic Engine
Built-in real-time rendering of performance metrics, including Time to First Token (TTFT) and Throughput (TPS), to quantitatively evaluate model response speed.

### Modular Rendering Layout
Provides view modes such as side-by-side comparison, responsive scaling, and Markdown/simplified rendering switching, flexibly adapting to needs.

## Technical Architecture Analysis

Duplex采用现代前端技术栈构建：

| Component | Technology | Purpose |
|-----------|------------|---------|
| Framework | React 18 + Vite | Core execution environment |
| Language | TypeScript | Strongly typed logic layer |
| Styling | Tailwind CSS | Responsive UI |
| Routing | React Router DOM | Client-side routing |
| Animation | Motion (Framer Motion) | Smooth visual transitions |
| Storage | LocalStorage | Client-side persistence |

The technology selection reflects a focus on performance and user experience: Vite enables fast development, TypeScript ensures code quality, Tailwind CSS allows flexible styling, and Framer Motion adds animation effects.

## Use Cases and Practical Value

### Model Selection and Evaluation
When selecting a model for a specific scenario, you can test candidate models simultaneously with a set of prompts, compare output quality, response speed, and cost to assist decision-making.

### Prompt Engineering Optimization
Instantly view performance differences of the same prompt across different models, adjust the prompt structure targeted, and obtain more consistent and high-quality outputs.

### Hybrid Local and Cloud Deployment
Compare the performance of local and cloud models to determine which tasks can be handled locally and which need to call cloud APIs, balancing privacy and capability.

### Teaching and Demonstration
The side-by-side comparison view is suitable for teaching, helping students understand model characteristics; it can also be used as a technical demonstration tool to show AI diversity to non-technical personnel.

## Deployment and Usage Guide

Duplex is optimized for Netlify edge delivery, with deployment steps as follows:
1. Clone the repository and install dependencies
2. Run `npm run dev` to start the development server
3. To use local Ollama, configure `OLLAMA_ORIGINS="*" ollama serve` (to resolve cross-origin issues)
4. Push to GitHub and import into Netlify for automatic deployment

The project documentation emphasizes the importance of configuring cross-origin requests, reflecting a focus on security.

## Summary and Outlook

Duplex represents an important direction in the development of AI tools: enjoying the capabilities of large models while maintaining full control over data. Its multi-model parallel inference capability improves development efficiency and provides a scientific basis for model evaluation.

As local models (e.g., Llama, Mistral) become more capable and cloud APIs become more abundant, the value of Duplex becomes increasingly prominent, allowing developers to flexibly combine local and cloud models instead of choosing one over the other.

For technical personnel interested in AI application development, prompt engineering, or model evaluation, Duplex is an open-source project worth exploring and contributing to.