Zing Forum

Reading

VeRL-Omni: A Reinforcement Learning Training Framework for Diffusion Models and Omni-Modal Generative Models

VeRL-Omni is a reinforcement learning training framework specifically designed for multi-modal generative models. It supports RL post-training for diffusion models (e.g., Qwen-Image, Wan2.2) and omni-modal models (e.g., Qwen3-Omni), enables efficient inference based on vLLM-Omni, and provides implementations of various RL algorithms and an asynchronous reward calculation mechanism.

VeRL-Omni强化学习扩散模型多模态生成RL训练框架Qwen-ImagevLLM-OmniFlowGRPO视频生成昇腾NPU
Published 2026-06-12 17:16Recent activity 2026-06-12 17:21Estimated read 6 min
VeRL-Omni: A Reinforcement Learning Training Framework for Diffusion Models and Omni-Modal Generative Models
1

Section 01

Introduction

Introduction

VeRL-Omni is a reinforcement learning training framework specifically designed for multi-modal generative models. It supports RL post-training for diffusion models (e.g., Qwen-Image, Wan2.2) and omni-modal models (e.g., Qwen3-Omni). It enables efficient inference based on vLLM-Omni and provides various RL algorithms and an asynchronous reward calculation mechanism. The project is maintained by the verl-project, open-sourced on GitHub, and released on June 12, 2026.

2

Section 02

Background: Unique Challenges in RL Training for Multi-Modal Generative Models

Background: Unique Challenges in RL Training for Multi-Modal Generative Models

RLHF/DPO techniques for LLMs have proven effective in improving model alignment. However, multi-modal generative models (image/video/audio generation, omni-modal understanding) have large architectural differences (multi-step iteration for diffusion models, different flow matching/autoregressive strategies), making existing RL frameworks difficult to adapt: complex inference processes, high latency in reward calculation, and large differences in modal preprocessing workflows, thus creating a demand for specialized frameworks.

3

Section 03

Core Architecture and Technical Features

Core Architecture and Technical Features

  1. Optimized Inference Backend: Adopts vLLM-Omni (a multi-modal extension of vLLM) to achieve high-throughput sample generation;
  2. Asynchronous Reward Service: Supports HTTP Scorer interface, overlapping reward calculation with rollout to reduce waiting time;
  3. Modular Training Backend: Supports VeOmni/FSDP2, allowing combination of parallel strategies (USP/TP/DP);
  4. Stability Enhancement: Introduces mechanisms like rollout correction and deterministic rollout to address the instability issue in RL training of diffusion models.
4

Section 04

Supported Models and Algorithm Matrix

Supported Models and Algorithm Matrix

  • Qwen-Image (Text-to-Image): FlowGRPO (CPS/SDE), MixGRPO, GRPO-Guard, DiffusionNFT, DPO (all verified);
  • Wan2.2 (Text-to-Video): DanceGRPO (verified);
  • SD3.5 (Text-to-Image): DPO (verified);
  • LTX2.3 (Text-to-Video+Audio): FlowGRPO (in development);
  • BAGEL (Unified Understanding + Generation): FlowGRPO (in development);
  • HunyuanImage-3.0: MixGRPO, SRPO (planned);
  • Qwen3-Omni-Thinker (Omni-Modal): GSPO (in development).
5

Section 05

Performance Advantages and Domestic Hardware Support

Performance Advantages and Domestic Hardware Support

  • Performance Improvement: In Qwen-Image FlowGRPO tests, end-to-end throughput is 25% higher than the diffusers implementation (due to optimizations like vLLM-Omni inference, FSDP2 training, and asynchronous reward calculation);
  • Domestic Hardware Support: Natively supports Ascend NPUs, provides quick start guides, lowering the threshold for multi-modal RL training on domestic chips.
6

Section 06

Application Scenarios and Practical Significance

Application Scenarios and Practical Significance

  • Researchers: Stable and efficient baselines, reducing the threshold for reproduction;
  • Developers: Modular architecture for easy integration of new models/reward functions, with rich documentation and examples;
  • Enterprise Users: Performance optimizations and Ascend support reduce training costs, and asynchronous reward calculation adapts to external evaluation scenarios.
7

Section 07

Summary and Future Outlook

Summary and Future Outlook

VeRL-Omni addresses the unique challenges in RL training for multi-modal generative models and provides comprehensive support. Its rich model-algorithm matrix, performance advantages, and domestic hardware compatibility make it an important tool in this field. The project integrates with the verl and vLLM-Omni ecosystems and is continuously updated (e.g., adding DiffusionNFT/DPO), which will play a key role in multi-modal AI applications.