Zing Forum

Reading

vLLM Speculators: A Unified Framework for Production-Grade Large Model Inference Acceleration

Red Hat's open-source Speculators project provides a complete speculative decoding solution for vLLM, supporting the full workflow from training data generation to model deployment, and has been adapted to various mainstream architectures such as Llama, Qwen3, and GPT-OSS.

vLLMSpeculative Decoding投机解码LLM推理加速EAGLE-3Red Hat草稿模型大模型优化
Published 2026-04-01 22:44Recent activity 2026-04-01 22:48Estimated read 6 min
vLLM Speculators: A Unified Framework for Production-Grade Large Model Inference Acceleration
1

Section 01

【Introduction】vLLM Speculators: A Unified Framework for Production-Grade Large Model Inference Acceleration

Red Hat's open-source Speculators project provides a complete speculative decoding solution for vLLM, supporting the full workflow from training data generation to model deployment, and has been adapted to various mainstream architectures such as Llama, Qwen3, and GPT-OSS. This project aims to solve the inference latency problem of large models, achieve lossless acceleration, and help developers improve inference speed without sacrificing output quality.

2

Section 02

【Background】Latency Dilemma of Large Model Inference and the Value of Speculative Decoding

As the parameter scale of large language models expands, inference latency has become a key bottleneck in practical deployment (e.g., real-time dialogue, code completion scenarios). Traditional optimization methods (quantization, pruning) have precision losses, while speculative decoding, as a lossless acceleration technology, improves efficiency through the "draft-verify" mechanism and has become a focus of the industry.

3

Section 03

【Methodology】Speculative Decoding Principles and Speculators Framework Architecture

Core Principle: Speculative decoding uses a "draft-verify" mechanism: a small and fast draft model predicts multiple tokens in advance, and the main model verifies them in parallel. The passed tokens are directly adopted, ensuring the results are consistent with those generated independently by the main model (lossless).

Framework Architecture:

  1. Offline training data generation: Use the main model to extract hidden states as supervision signals for draft model training;
  2. Draft model training: Supports various architectures such as single-layer/multi-layer, dense/MoE/VLM;
  3. Standardized format: Compatible with Hugging Face, providing conversion tools to lower the access threshold;
  4. Seamless vLLM integration: After training, the model can be directly deployed via vllm serve and automatically reads the configuration.
4

Section 04

【Evidence】Supported Model Matrix and Performance Evaluation

Supported Model Matrix:

Main Model Architecture Model Scale Training Scheme vLLM Deployment Support
Llama3.x 8B/70B EAGLE-3
Qwen3 8B/14B/32B EAGLE-3
Qwen3 MoE 30B/235B EAGLE-3
Qwen3-VL 235B-A22B EAGLE-3
GPT-OSS 20B/120B EAGLE-3
Mistral3 Large 675B EAGLE-3

Performance Evaluation: Integrates the GuideLLM benchmark framework, which can accurately measure latency gains; supports the combination of FP8 dynamic quantization and speculative decoding to further reduce memory usage and computational overhead.

5

Section 05

【Practice】Quick Start and Community Ecosystem

Quick Start: Installation: pip install speculators; Enable data generation function: pip install -e ".[datagen]" Deployment: vllm serve RedHatAI/Qwen3-8B-speculator.eagle3 (automatically reads configuration)

Community Ecosystem:

  • Open source license: Apache2.0;
  • Community support: Slack channels #speculators and #feat-spec-decode;
  • Rich example code: Covers the full workflow of data generation, training, evaluation, and deployment.
6

Section 06

【Conclusion】Significance and Future Outlook of the Speculators Project

The Speculators project marks an important milestone in the transition of speculative decoding technology from academic research to production applications. By standardizing the training framework, supporting a wide range of models, and providing a seamless deployment experience, it lowers the threshold for developers to adopt speculative decoding. For teams deploying large models in latency-sensitive scenarios, this project is worth in-depth exploration. With community contributions and algorithm iterations, we look forward to speculative decoding becoming one of the standard configurations for LLM inference optimization.