# VeRL-Omni: A Reinforcement Learning Training Framework for Diffusion Models and Omni-Modal Generative Models

> VeRL-Omni is a reinforcement learning training framework specifically designed for multi-modal generative models. It supports RL post-training for diffusion models (e.g., Qwen-Image, Wan2.2) and omni-modal models (e.g., Qwen3-Omni), enables efficient inference based on vLLM-Omni, and provides implementations of various RL algorithms and an asynchronous reward calculation mechanism.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-12T09:16:40.000Z
- 最近活动: 2026-06-12T09:21:00.257Z
- 热度: 154.9
- 关键词: VeRL-Omni, 强化学习, 扩散模型, 多模态生成, RL训练框架, Qwen-Image, vLLM-Omni, FlowGRPO, 视频生成, 昇腾NPU
- 页面链接: https://www.zingnex.cn/en/forum/thread/verl-omni
- Canonical: https://www.zingnex.cn/forum/thread/verl-omni
- Markdown 来源: floors_fallback

---

## Introduction

### Introduction
VeRL-Omni is a reinforcement learning training framework specifically designed for multi-modal generative models. It supports RL post-training for diffusion models (e.g., Qwen-Image, Wan2.2) and omni-modal models (e.g., Qwen3-Omni). It enables efficient inference based on vLLM-Omni and provides various RL algorithms and an asynchronous reward calculation mechanism. The project is maintained by the verl-project, open-sourced on GitHub, and released on June 12, 2026.

## Background: Unique Challenges in RL Training for Multi-Modal Generative Models

### Background: Unique Challenges in RL Training for Multi-Modal Generative Models
RLHF/DPO techniques for LLMs have proven effective in improving model alignment. However, multi-modal generative models (image/video/audio generation, omni-modal understanding) have large architectural differences (multi-step iteration for diffusion models, different flow matching/autoregressive strategies), making existing RL frameworks difficult to adapt: complex inference processes, high latency in reward calculation, and large differences in modal preprocessing workflows, thus creating a demand for specialized frameworks.

## Core Architecture and Technical Features

### Core Architecture and Technical Features
1. **Optimized Inference Backend**: Adopts vLLM-Omni (a multi-modal extension of vLLM) to achieve high-throughput sample generation;
2. **Asynchronous Reward Service**: Supports HTTP Scorer interface, overlapping reward calculation with rollout to reduce waiting time;
3. **Modular Training Backend**: Supports VeOmni/FSDP2, allowing combination of parallel strategies (USP/TP/DP);
4. **Stability Enhancement**: Introduces mechanisms like rollout correction and deterministic rollout to address the instability issue in RL training of diffusion models.

## Supported Models and Algorithm Matrix

### Supported Models and Algorithm Matrix
- **Qwen-Image (Text-to-Image)**: FlowGRPO (CPS/SDE), MixGRPO, GRPO-Guard, DiffusionNFT, DPO (all verified);
- **Wan2.2 (Text-to-Video)**: DanceGRPO (verified);
- **SD3.5 (Text-to-Image)**: DPO (verified);
- **LTX2.3 (Text-to-Video+Audio)**: FlowGRPO (in development);
- **BAGEL (Unified Understanding + Generation)**: FlowGRPO (in development);
- **HunyuanImage-3.0**: MixGRPO, SRPO (planned);
- **Qwen3-Omni-Thinker (Omni-Modal)**: GSPO (in development).

## Performance Advantages and Domestic Hardware Support

### Performance Advantages and Domestic Hardware Support
- **Performance Improvement**: In Qwen-Image FlowGRPO tests, end-to-end throughput is 25% higher than the diffusers implementation (due to optimizations like vLLM-Omni inference, FSDP2 training, and asynchronous reward calculation);
- **Domestic Hardware Support**: Natively supports Ascend NPUs, provides quick start guides, lowering the threshold for multi-modal RL training on domestic chips.

## Application Scenarios and Practical Significance

### Application Scenarios and Practical Significance
- **Researchers**: Stable and efficient baselines, reducing the threshold for reproduction;
- **Developers**: Modular architecture for easy integration of new models/reward functions, with rich documentation and examples;
- **Enterprise Users**: Performance optimizations and Ascend support reduce training costs, and asynchronous reward calculation adapts to external evaluation scenarios.

## Summary and Future Outlook

### Summary and Future Outlook
VeRL-Omni addresses the unique challenges in RL training for multi-modal generative models and provides comprehensive support. Its rich model-algorithm matrix, performance advantages, and domestic hardware compatibility make it an important tool in this field. The project integrates with the verl and vLLM-Omni ecosystems and is continuously updated (e.g., adding DiffusionNFT/DPO), which will play a key role in multi-modal AI applications.
