Reading

Orthrus: An LLM Inference Acceleration Framework Enabling Lossless Parallel Generation via Dual-View Diffusion

Orthrus is an innovative dual-architecture framework that combines the precise generation fidelity of autoregressive large language models (LLMs) with the high-speed parallel generation capability of diffusion models, achieving up to 7.8x inference acceleration while maintaining strictly lossless output quality.

LLM推理加速扩散模型并行生成Qwen3推测解码KV缓存优化MLXApple Silicon

Published 2026-06-06 20:14Recent activity 2026-06-06 20:49Estimated read 5 min

Orthrus: An LLM Inference Acceleration Framework Enabling Lossless Parallel Generation via Dual-View Diffusion

Section 01

Orthrus: Introduction to the LLM Lossless Parallel Inference Acceleration Framework via Dual-View Diffusion

This article introduces the Orthrus framework, which combines the precise generation of autoregressive LLMs with the parallel capability of diffusion models to achieve up to 7.8x inference acceleration while maintaining strictly lossless output quality. Its core is a dual-view diffusion architecture based on the Qwen3 backbone network, supporting the MLX framework and Apple Silicon with zero redundant memory overhead.

Section 02

Current Status and Challenges of LLM Inference

Autoregressive LLMs produce high-quality outputs but face a sequential bottleneck—each token must wait for the previous one to be generated, which is more pronounced in long-text scenarios. Diffusion language models attempt parallel decoding but are prone to conditional drift and accuracy degradation. The key challenge is balancing autoregressive quality with parallel speed.

Section 03

Design of Orthrus' Dual-View Diffusion Architecture

Orthrus adopts a dual-view diffusion architecture:

Autoregressive View: Maintains sequential decoding to ensure quality
Diffusion View: Supports parallel token generation to break through bottlenecks Both views share the KV cache, avoiding redundant memory in traditional speculative decoding. Through an in-model consensus mechanism, it ensures that parallel outputs are completely consistent with the original model's prediction distribution, achieving strict losslessness.

Section 04

Performance Test Data and Comparative Analysis

Orthrus models based on Qwen3 show significant acceleration effects:

Model	Base Model	Average Speedup
Orthrus-Qwen3-1.7B	Qwen3-1.7B	4.25×
Orthrus-Qwen3-4B	Qwen3-4.0B	5.20×
Orthrus-Qwen3-8B	Qwen3-8.0B	5.36×
The maximum acceleration reaches 7.8x for specific tasks.
Compared to speculative decoding methods (e.g., EAGLE-3, DFlash), it maintains stable throughput under long contexts (40K); compared to diffusion models (e.g., Fast-dLLM-v2), it achieves about 6x acceleration in the MATH-500 benchmark while maintaining lossless accuracy.

Section 05

Memory Efficiency and Parameter Optimization Features

Orthrus' dual views share the same KV cache, with O(1) level memory overhead and zero redundancy. Only 16% of the total model parameters need to be fine-tuned to inject parallel capability, while the base LLM remains frozen, reducing adaptation costs.

Section 06

Platform Support and Model Availability

The official team has released three Qwen3 model versions on HuggingFace:

chiennv/Orthrus-Qwen3-1.7B
chiennv/Orthrus-Qwen3-4B
chiennv/Orthrus-Qwen3-8B It natively supports inference on Apple Silicon via the MLX framework, compatible with mlx==0.31.2 and mlx-lm==0.31.3 versions.

Section 07

Technical Significance and Application Prospects

Orthrus proves that parallel generation and lossless quality can coexist, bringing important progress to the field of LLM inference optimization. Its practical application values include: reducing inference costs, improving user experience (reducing latency), and expanding application scenarios for edge devices.

Section 08

Summary

Orthrus breaks the sequential bottleneck of autoregressive models through its dual-view diffusion architecture, achieving multiple times acceleration while maintaining strict losslessness. Its zero redundant memory overhead and parameter-efficient training features make it an excellent inference optimization solution for deploying LLMs in production environments.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49