Reading

Duplex: A Local-First Multi-Model Parallel Inference Engine

A privacy-first client application that supports simultaneous connections to local Ollama and multiple cloud-based large model APIs, enabling true parallel inference and real-time comparison.

LLM多模型推理Ollama隐私优先ReactTypeScript开源工具AI开发

Published 2026-06-07 23:27Recent activity 2026-06-07 23:52Estimated read 10 min

Duplex: A Local-First Multi-Model Parallel Inference Engine

Section 01

Duplex: Introduction to the Local-First Multi-Model Parallel Inference Engine

Duplex is a local-first multi-model parallel inference engine that supports simultaneous connections to local Ollama and multiple cloud-based large model APIs, enabling true parallel inference and real-time comparison. Developed and maintained by Ryuk1811, this project is open-sourced on GitHub (link: https://github.com/Ryuk1811/Duplex) under the MIT License. Its core philosophy is privacy-first: all application states are persisted locally via localStorage, with no external databases or telemetry tracking, and user conversation data remains entirely local. Duplex addresses the dilemma developers face between the privacy of local models and the performance of cloud models, as well as the time-consuming pain point of traditional model testing one by one, providing an efficient tool for scenarios like model selection and prompt engineering.

Section 02

Background: Why Do We Need Multi-Model Parallel Inference?

When using large language models, developers often face a dilemma: choose local models to protect privacy, or use cloud APIs to get stronger performance? Different models perform differently in specific tasks (code generation, logical reasoning, creative writing, etc.). The traditional process requires testing model outputs one by one, which is time-consuming and makes it difficult to compare horizontally. Duplex was created to solve this pain point, allowing developers to send the same prompt to multiple models simultaneously and observe response differences in real time on a unified interface.

Section 03

Project Overview: What Is Duplex?

Duplex is an offline-first multiplexed large language model inference engine that allows engineers and researchers to run parallel real-time prompt tests simultaneously, supporting both locally hosted models (e.g., Ollama, LM Studio, vLLM) and cloud models (e.g., OpenAI, Anthropic, Gemini, Groq). Its core philosophy is 'privacy-first': all configurations (model selection, theme, layout) are stored in the browser's localStorage, with no backend services. The application can run offline, and only explicitly sent cloud requests will leave the device.

Section 04

Core Features and Technical Highlights

True Multiplexed Inference

Supports simultaneous streaming of inference results from up to three AI models, with side-by-side output viewing, facilitating model selection, prompt engineering, and performance benchmarking.

Fully Private Local State

No dependency on backend services; all configurations are stored in localStorage, protecting privacy and supporting offline operation.

Cross-Platform Compatibility

Can connect to local instances (e.g., Ollama) or cloud providers (via API keys), and supports custom endpoints in OpenAI standard format (e.g., Perplexity).

Real-Time Diagnostic Engine

Built-in real-time rendering of performance metrics, including Time to First Token (TTFT) and Throughput (TPS), to quantitatively evaluate model response speed.

Modular Rendering Layout

Provides view modes such as side-by-side comparison, responsive scaling, and Markdown/simplified rendering switching, flexibly adapting to needs.

Section 05

Technical Architecture Analysis

Duplex采用现代前端技术栈构建：

Component	Technology	Purpose
Framework	React 18 + Vite	Core execution environment
Language	TypeScript	Strongly typed logic layer
Styling	Tailwind CSS	Responsive UI
Routing	React Router DOM	Client-side routing
Animation	Motion (Framer Motion)	Smooth visual transitions
Storage	LocalStorage	Client-side persistence

The technology selection reflects a focus on performance and user experience: Vite enables fast development, TypeScript ensures code quality, Tailwind CSS allows flexible styling, and Framer Motion adds animation effects.

Section 06

Use Cases and Practical Value

Model Selection and Evaluation

When selecting a model for a specific scenario, you can test candidate models simultaneously with a set of prompts, compare output quality, response speed, and cost to assist decision-making.

Prompt Engineering Optimization

Instantly view performance differences of the same prompt across different models, adjust the prompt structure targeted, and obtain more consistent and high-quality outputs.

Hybrid Local and Cloud Deployment

Compare the performance of local and cloud models to determine which tasks can be handled locally and which need to call cloud APIs, balancing privacy and capability.

Teaching and Demonstration

The side-by-side comparison view is suitable for teaching, helping students understand model characteristics; it can also be used as a technical demonstration tool to show AI diversity to non-technical personnel.

Section 07

Deployment and Usage Guide

Duplex is optimized for Netlify edge delivery, with deployment steps as follows:

Clone the repository and install dependencies
Run npm run dev to start the development server
To use local Ollama, configure OLLAMA_ORIGINS="*" ollama serve (to resolve cross-origin issues)
Push to GitHub and import into Netlify for automatic deployment

The project documentation emphasizes the importance of configuring cross-origin requests, reflecting a focus on security.

Section 08

Summary and Outlook

Duplex represents an important direction in the development of AI tools: enjoying the capabilities of large models while maintaining full control over data. Its multi-model parallel inference capability improves development efficiency and provides a scientific basis for model evaluation.

As local models (e.g., Llama, Mistral) become more capable and cloud APIs become more abundant, the value of Duplex becomes increasingly prominent, allowing developers to flexibly combine local and cloud models instead of choosing one over the other.

For technical personnel interested in AI application development, prompt engineering, or model evaluation, Duplex is an open-source project worth exploring and contributing to.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49