# FastDeploy v2.4: PaddlePaddle Large Model Inference Deployment Toolkit and PD Disaggregation Architecture Practice

> FastDeploy is a large language model (LLM) and vision-language model (VLM) inference deployment toolkit based on PaddlePaddle. The v2.4 version adds PD disaggregation deployment for DeepSeek V3 and Qwen3-MoE, enhances MTP speculative decoding capabilities, and fully optimizes MoE inference and multimodal prefix caching performance across multiple hardware platforms.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-31T08:14:14.000Z
- 最近活动: 2026-03-31T08:31:32.359Z
- 热度: 165.7
- 关键词: PaddlePaddle, FastDeploy, LLM Inference, VLM, PD Disaggregation, Speculative Decoding, Quantization, ERNIE, DeepSeek, Qwen, 国产 AI 芯片
- 页面链接: https://www.zingnex.cn/en/forum/thread/fastdeploy-v2-4-pd
- Canonical: https://www.zingnex.cn/forum/thread/fastdeploy-v2-4-pd
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: FastDeploy v2.4: PaddlePaddle Large Model Inference Deployment Toolkit and PD Disaggregation Architecture Practice

FastDeploy is a large language model (LLM) and vision-language model (VLM) inference deployment toolkit based on PaddlePaddle. The v2.4 version adds PD disaggregation deployment for DeepSeek V3 and Qwen3-MoE, enhances MTP speculative decoding capabilities, and fully optimizes MoE inference and multimodal prefix caching performance across multiple hardware platforms.

## Project Overview

FastDeploy is an LLM and VLM inference deployment toolkit in Baidu PaddlePaddle's ecosystem, dedicated to providing out-of-the-box production-grade deployment solutions. The project has been deeply optimized for enterprise application scenarios, supporting multiple hardware platforms and rich acceleration technologies.

The v2.4 version, released in January 2026, brings several important updates, including support for PD disaggregation deployment of DeepSeek V3 and Qwen3-MoE models, enhanced MTP (Multi-Token Prediction) speculative decoding capabilities, and full optimization of MoE inference and multimodal prefix caching across multiple hardware platforms.

## Load-Balanced PD Disaggregation

PD Disaggregation (Prefill-Decode Disaggregation) is a key technology to improve LLM inference efficiency. FastDeploy implements an industrial-grade PD disaggregation solution:

- **Context Caching**: KV Cache computed during the Prefill phase can be reused
- **Dynamic Instance Role Switching**: Dynamically adjust the Prefill/Decode role of instances based on load
- **SLO Guarantee**: Ensure Service Level Objectives are met while optimizing resource utilization
- **Throughput Optimization**: Improve overall throughput by separating compute-intensive and memory-intensive phases

## Unified KV Cache Transmission

FastDeploy provides a lightweight and high-performance KV cache transmission library:
- **Intelligent Transmission Protocol Selection**: Automatically select NVLink or RDMA for optimal performance
- **Low-Latency Transmission**: Optimize serialization and transmission overhead
- **Cross-Node Sharing**: Support KV Cache sharing in distributed deployments

## OpenAI API Compatibility and vLLM Compatibility

FastDeploy provides interfaces compatible with industry standards:
- **One-Command Deployment**: Simplify the deployment process
- **OpenAI API Compatibility**: Existing applications can migrate seamlessly
- **vLLM Interface Compatibility**: Maintain compatibility with the vLLM ecosystem

## Full Quantization Format Support

To reduce deployment costs, FastDeploy supports multiple quantization schemes:
- **W8A16**: 8-bit weights, 16-bit activations
- **W8A8**: 8-bit weights and activations
- **W4A16**: 4-bit weights, 16-bit activations
- **W4A8**: 4-bit weights, 8-bit activations
- **W2A16**: 2-bit weights, 16-bit activations
- **FP8**: 8-bit floating-point quantization

## Advanced Acceleration Technologies

**Speculative Decoding**
Generate drafts with small models and verify in parallel with large models, significantly accelerating the generation process. The v2.4 version enhances MTP (Multi-Token Prediction) capabilities, allowing prediction of multiple tokens at a time.

**Multi-Token Prediction (MTP)**
Based on speculative decoding, predict multiple subsequent tokens at a time to further improve decoding efficiency.

**Chunked Prefill**
Process the prefill phase of long sequences in chunks to balance resource utilization between prefill and decode phases and reduce latency spikes.

**Prefix Caching**
Cache KV values of common prefixes, which can significantly reduce first-token latency for multi-turn conversations and system prompt reuse scenarios. The v2.4 version has been specially optimized for multimodal scenarios.

## Multi-Hardware Platform Support

FastDeploy supports a variety of domestic AI accelerators:

| Hardware Platform | Support Status | Description |
|-------------------|----------------|-------------|
| NVIDIA GPU | Fully Supported | CUDA Ecosystem |
| Kunlun XPU | Fully Supported | Baidu Self-developed |
| Hygon DCU | Fully Supported | Domestic GPU |
| Iluvatar CoreX GPU | Fully Supported | - |
| Enflame GCU | Fully Supported | Models like S60 |
| Muxi GPU | Fully Supported | - |
| Intel Gaudi | Fully Supported | - |

This extensive hardware support allows enterprises to flexibly choose computing platforms based on factors such as cost, performance, and supply chain.
