# Tencent Hunyuan Open-Sources UniRL: A Unified Reinforcement Learning Framework for Multimodal Models

> The Tencent Hunyuan team has open-sourced UniRL, a general-purpose reinforcement learning (RL) training framework that supports diffusion models, autoregressive models, and unified models, enabling a unified paradigm for cross-modal RL post-training.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-09T07:59:25.000Z
- 最近活动: 2026-06-09T08:19:06.282Z
- 热度: 145.7
- 关键词: UniRL, 腾讯混元, 多模态模型, 强化学习, 扩散模型, 大语言模型, RLHF, FlowDPPO, DRPO, 开源框架
- 页面链接: https://www.zingnex.cn/en/forum/thread/unirl
- Canonical: https://www.zingnex.cn/forum/thread/unirl
- Markdown 来源: floors_fallback

---

## Introduction: Tencent Hunyuan Open-Sources UniRL — A Unified RL Training Framework for Multimodal Models

The Tencent Hunyuan team has open-sourced UniRL, a general-purpose reinforcement learning training framework that supports diffusion models, autoregressive models, and unified models. It aims to solve the fragmentation problem where different model architectures in the multimodal field require independent RL training solutions, and achieve a unified paradigm for cross-modal RL post-training. The project has been open-sourced on GitHub, providing efficient training infrastructure for researchers and engineers.

## Project Background: Fragmentation Pain Points in the Multimodal AI Ecosystem

The current multimodal AI ecosystem is highly fragmented: diffusion models are used for image/video generation, autoregressive models handle text/visual understanding, and unified models integrate the capabilities of both. However, each model type requires a specialized RL training framework (e.g., diffusion models need continuous noise space policy optimization, while autoregressive models rely on token-level reward calculation). This fragmentation leads to repeated development, resource waste, and hinders cross-modal technology transfer and reuse.

## Core Design: Layered Composable Architecture and Innovative Algorithms

The core design concept of UniRL is to abstract the general RL loop (generate samples → evaluate rewards → compute advantages → update policy → sync weights) and implement it through a layered composable architecture:
1. Entry layer: Training entries for different model domains (e.g., train_diffusion, train_ar, etc.);
2. Trainer layer: Trainers corresponding to different models (e.g., DiffusionTrainer, ARTrainer);
3. Plugin component layer: Rollout engine, algorithm implementations, etc.;
4. Distributed runtime layer: Based on Ray, FSDP, etc.
Supported models include Stable Diffusion 3, Qwen-VL, HunyuanImage3, etc. Innovative algorithms such as FlowDPPO (PPO optimization for flow matching models) and DRPO (alleviating LLM RLHF mode collapse) are proposed.

## Technical Implementation Highlights and Training Modes

Technical highlights of UniRL:
- Unified RL loop abstraction: Applicable to all supported model types;
- Flexible Rollout engine: Supports inference backends like vLLM, SGLang, etc.;
- Distributed training: Based on Ray, supports data parallelism, model parallelism, etc.;
- Decoupled reward service: Independent reward service supports multiple backends (learning-based, rule-based, external APIs).
Training modes provide four entries (diffusion/ar/pe/unified_model) via the Hydra configuration system. Users can start training with simple commands (e.g., `python -m unirl.train_diffusion --config-name=diffusion/sd3_trainside`).

## Application Value: Lowering Thresholds, Promoting Transfer, and Accelerating Deployment

Value of UniRL open-source:
- Lower research thresholds: Researchers do not need to rebuild infrastructure and can focus on algorithm innovation;
- Promote technology transfer: LLM RL technology can be transferred to the diffusion model domain and vice versa;
- Accelerate industrial deployment: The unified framework reduces maintenance costs and is suitable for enterprise multi-model scenarios;
- Drive unified model development: Supports training of unified models like HunyuanImage3.

## Summary and Future Roadmap

UniRL achieves the goal of "one codebase, multiple models" and is an important progress in RL training frameworks for multimodal models. The future roadmap includes:
1. Expand algorithm coverage (support new models like FLUX.2-Klein, HunyuanVideo, etc.);
2. Cross-domain transfer algorithms (extend FlowDPPO and DRPO to more models);
3. Enrich reward backends;
4. Optimize Rollout engine efficiency.
Project GitHub repository: https://github.com/Tencent-Hunyuan/UniRL. Official documents and example configurations can be obtained via relevant links.
