Zing Forum

Reading

FastDeploy v2.4: PaddlePaddle Large Model Inference Deployment Toolkit and PD Disaggregation Architecture Practice

FastDeploy is a large language model (LLM) and vision-language model (VLM) inference deployment toolkit based on PaddlePaddle. The v2.4 version adds PD disaggregation deployment for DeepSeek V3 and Qwen3-MoE, enhances MTP speculative decoding capabilities, and fully optimizes MoE inference and multimodal prefix caching performance across multiple hardware platforms.

PaddlePaddleFastDeployLLM InferenceVLMPD DisaggregationSpeculative DecodingQuantizationERNIEDeepSeekQwen
Published 2026-03-31 16:14Recent activity 2026-03-31 16:31Estimated read 7 min
FastDeploy v2.4: PaddlePaddle Large Model Inference Deployment Toolkit and PD Disaggregation Architecture Practice
1

Section 01

Introduction / Main Floor: FastDeploy v2.4: PaddlePaddle Large Model Inference Deployment Toolkit and PD Disaggregation Architecture Practice

FastDeploy is a large language model (LLM) and vision-language model (VLM) inference deployment toolkit based on PaddlePaddle. The v2.4 version adds PD disaggregation deployment for DeepSeek V3 and Qwen3-MoE, enhances MTP speculative decoding capabilities, and fully optimizes MoE inference and multimodal prefix caching performance across multiple hardware platforms.

2

Section 02

Project Overview

FastDeploy is an LLM and VLM inference deployment toolkit in Baidu PaddlePaddle's ecosystem, dedicated to providing out-of-the-box production-grade deployment solutions. The project has been deeply optimized for enterprise application scenarios, supporting multiple hardware platforms and rich acceleration technologies.

The v2.4 version, released in January 2026, brings several important updates, including support for PD disaggregation deployment of DeepSeek V3 and Qwen3-MoE models, enhanced MTP (Multi-Token Prediction) speculative decoding capabilities, and full optimization of MoE inference and multimodal prefix caching across multiple hardware platforms.

3

Section 03

Load-Balanced PD Disaggregation

PD Disaggregation (Prefill-Decode Disaggregation) is a key technology to improve LLM inference efficiency. FastDeploy implements an industrial-grade PD disaggregation solution:

  • Context Caching: KV Cache computed during the Prefill phase can be reused
  • Dynamic Instance Role Switching: Dynamically adjust the Prefill/Decode role of instances based on load
  • SLO Guarantee: Ensure Service Level Objectives are met while optimizing resource utilization
  • Throughput Optimization: Improve overall throughput by separating compute-intensive and memory-intensive phases
4

Section 04

Unified KV Cache Transmission

FastDeploy provides a lightweight and high-performance KV cache transmission library:

  • Intelligent Transmission Protocol Selection: Automatically select NVLink or RDMA for optimal performance
  • Low-Latency Transmission: Optimize serialization and transmission overhead
  • Cross-Node Sharing: Support KV Cache sharing in distributed deployments
5

Section 05

OpenAI API Compatibility and vLLM Compatibility

FastDeploy provides interfaces compatible with industry standards:

  • One-Command Deployment: Simplify the deployment process
  • OpenAI API Compatibility: Existing applications can migrate seamlessly
  • vLLM Interface Compatibility: Maintain compatibility with the vLLM ecosystem
6

Section 06

Full Quantization Format Support

To reduce deployment costs, FastDeploy supports multiple quantization schemes:

  • W8A16: 8-bit weights, 16-bit activations
  • W8A8: 8-bit weights and activations
  • W4A16: 4-bit weights, 16-bit activations
  • W4A8: 4-bit weights, 8-bit activations
  • W2A16: 2-bit weights, 16-bit activations
  • FP8: 8-bit floating-point quantization
7

Section 07

Advanced Acceleration Technologies

Speculative Decoding Generate drafts with small models and verify in parallel with large models, significantly accelerating the generation process. The v2.4 version enhances MTP (Multi-Token Prediction) capabilities, allowing prediction of multiple tokens at a time.

Multi-Token Prediction (MTP) Based on speculative decoding, predict multiple subsequent tokens at a time to further improve decoding efficiency.

Chunked Prefill Process the prefill phase of long sequences in chunks to balance resource utilization between prefill and decode phases and reduce latency spikes.

Prefix Caching Cache KV values of common prefixes, which can significantly reduce first-token latency for multi-turn conversations and system prompt reuse scenarios. The v2.4 version has been specially optimized for multimodal scenarios.

8

Section 08

Multi-Hardware Platform Support

FastDeploy supports a variety of domestic AI accelerators:

Hardware Platform Support Status Description
NVIDIA GPU Fully Supported CUDA Ecosystem
Kunlun XPU Fully Supported Baidu Self-developed
Hygon DCU Fully Supported Domestic GPU
Iluvatar CoreX GPU Fully Supported -
Enflame GCU Fully Supported Models like S60
Muxi GPU Fully Supported -
Intel Gaudi Fully Supported -

This extensive hardware support allows enterprises to flexibly choose computing platforms based on factors such as cost, performance, and supply chain.