Zing Forum

Reading

Orthrus: An LLM Inference Acceleration Framework Enabling Lossless Parallel Generation via Dual-View Diffusion

Orthrus is an innovative dual-architecture framework that combines the precise generation fidelity of autoregressive large language models (LLMs) with the high-speed parallel generation capability of diffusion models, achieving up to 7.8x inference acceleration while maintaining strictly lossless output quality.

LLM推理加速扩散模型并行生成Qwen3推测解码KV缓存优化MLXApple Silicon
Published 2026-06-06 20:14Recent activity 2026-06-06 20:49Estimated read 5 min
Orthrus: An LLM Inference Acceleration Framework Enabling Lossless Parallel Generation via Dual-View Diffusion
1

Section 01

Orthrus: Introduction to the LLM Lossless Parallel Inference Acceleration Framework via Dual-View Diffusion

This article introduces the Orthrus framework, which combines the precise generation of autoregressive LLMs with the parallel capability of diffusion models to achieve up to 7.8x inference acceleration while maintaining strictly lossless output quality. Its core is a dual-view diffusion architecture based on the Qwen3 backbone network, supporting the MLX framework and Apple Silicon with zero redundant memory overhead.

2

Section 02

Current Status and Challenges of LLM Inference

Autoregressive LLMs produce high-quality outputs but face a sequential bottleneck—each token must wait for the previous one to be generated, which is more pronounced in long-text scenarios. Diffusion language models attempt parallel decoding but are prone to conditional drift and accuracy degradation. The key challenge is balancing autoregressive quality with parallel speed.

3

Section 03

Design of Orthrus' Dual-View Diffusion Architecture

Orthrus adopts a dual-view diffusion architecture:

  • Autoregressive View: Maintains sequential decoding to ensure quality
  • Diffusion View: Supports parallel token generation to break through bottlenecks Both views share the KV cache, avoiding redundant memory in traditional speculative decoding. Through an in-model consensus mechanism, it ensures that parallel outputs are completely consistent with the original model's prediction distribution, achieving strict losslessness.
4

Section 04

Performance Test Data and Comparative Analysis

Orthrus models based on Qwen3 show significant acceleration effects:

Model Base Model Average Speedup
Orthrus-Qwen3-1.7B Qwen3-1.7B 4.25×
Orthrus-Qwen3-4B Qwen3-4.0B 5.20×
Orthrus-Qwen3-8B Qwen3-8.0B 5.36×
The maximum acceleration reaches 7.8x for specific tasks.
Compared to speculative decoding methods (e.g., EAGLE-3, DFlash), it maintains stable throughput under long contexts (40K); compared to diffusion models (e.g., Fast-dLLM-v2), it achieves about 6x acceleration in the MATH-500 benchmark while maintaining lossless accuracy.
5

Section 05

Memory Efficiency and Parameter Optimization Features

Orthrus' dual views share the same KV cache, with O(1) level memory overhead and zero redundancy. Only 16% of the total model parameters need to be fine-tuned to inject parallel capability, while the base LLM remains frozen, reducing adaptation costs.

6

Section 06

Platform Support and Model Availability

The official team has released three Qwen3 model versions on HuggingFace:

  • chiennv/Orthrus-Qwen3-1.7B
  • chiennv/Orthrus-Qwen3-4B
  • chiennv/Orthrus-Qwen3-8B It natively supports inference on Apple Silicon via the MLX framework, compatible with mlx==0.31.2 and mlx-lm==0.31.3 versions.
7

Section 07

Technical Significance and Application Prospects

Orthrus proves that parallel generation and lossless quality can coexist, bringing important progress to the field of LLM inference optimization. Its practical application values include: reducing inference costs, improving user experience (reducing latency), and expanding application scenarios for edge devices.

8

Section 08

Summary

Orthrus breaks the sequential bottleneck of autoregressive models through its dual-view diffusion architecture, achieving multiple times acceleration while maintaining strict losslessness. Its zero redundant memory overhead and parameter-efficient training features make it an excellent inference optimization solution for deploying LLMs in production environments.