Zing Forum

Reading

Flow-OPD: Introducing Policy Distillation Technology from Large Language Models to Image Generation Models

Researchers applied the successful On-Policy Distillation (OPD) technology from the LLM field to Flow Matching image generation models, proposing the Flow-OPD framework, which achieves significant performance improvements on Stable Diffusion 3.5.

Flow MatchingOn-Policy Distillation图像生成Stable Diffusion策略蒸馏多任务对齐强化学习文本到图像
Published 2026-05-09 01:50Recent activity 2026-05-11 13:18Estimated read 8 min
Flow-OPD: Introducing Policy Distillation Technology from Large Language Models to Image Generation Models
1

Section 01

【Introduction】Flow-OPD: Empowering Image Generation Models with LLM Policy Distillation Technology

Researchers applied the successful On-Policy Distillation (OPD) technology from the Large Language Model (LLM) field to Flow Matching image generation models, proposing the Flow-OPD framework. This framework addresses two core issues faced by Flow Matching models during the fine-tuning alignment phase: sparse rewards and gradient interference, and achieves significant performance improvements on Stable Diffusion 3.5, providing a new paradigm for multi-task alignment of image generation models.

2

Section 02

Background: Flow Matching Technology and Existing Bottlenecks

Flow Matching and Image Generation

Flow Matching is a significant technological breakthrough in the field of image generation, providing a more direct and efficient training method for diffusion models. By learning deterministic transformation paths between probability distributions, it simplifies the generation process and improves training stability and quality. Mainstream models such as Stable Diffusion 3.5 have adopted this technology.

Existing Bottlenecks

Sparse Reward Problem

Traditional reinforcement learning uses scalar reward signals to optimize models, but sparse feedback is difficult to guide fine-grained improvements in complex image generation tasks, leading to low learning efficiency.

Gradient Interference and the 'Seesaw Effect'

When optimizing multiple heterogeneous objectives (image quality, text alignment, etc.), gradients interfere with each other, leading to the 'seesaw effect' (improving one metric causes another to decline), and may also lead to reward cheating behaviors.

3

Section 03

Solution: Detailed Explanation of the Flow-OPD Framework

Flow-OPD is the first unified post-training framework that integrates policy distillation into Flow Matching models, with core components including:

Two-Stage Alignment Strategy

Stage 1: Cultivate Domain Experts

Use single-reward GRPO fine-tuning to train specialized teacher models for each specific domain (text rendering, aesthetic quality, etc.), avoiding multi-objective conflicts.

Stage 2: Knowledge Distillation and Integration

Establish an initial strategy through Flow-based Cold-Start, then integrate heterogeneous expert knowledge in three steps:

  1. On-policy sampling: Generate samples from the current strategy
  2. Task routing annotation: Assign optimal teacher guidance according to task type
  3. Dense trajectory-level supervision: Use complete generation trajectories for fine-grained learning

Manifold Anchoring Regularization (MAR)

Use task-agnostic teacher models to provide full-data supervision, anchor the generation distribution to a high-quality manifold, ensure image fidelity and alignment with human preferences, and solve the aesthetic degradation problem in pure reinforcement learning alignment.

4

Section 04

Experimental Results: Significant Improvements on Stable Diffusion 3.5

The experimental results on Stable Diffusion 3.5 Medium are as follows:

Metric Baseline Flow-OPD Improvement
GenEval Score 63 92 +46%
OCR Accuracy 59% 94% +59%
vs. vanilla GRPO - - +10 points

In addition, while achieving improvements, it maintains image fidelity and alignment with human preferences. The 'Teacher Surpassing Effect' was also observed—student models surpass specialized trained teacher models in some aspects.

5

Section 05

Technical Insights and Significance

Cross-Domain Technology Transfer Value

Flow-OPD proves that technologies from the LLM field (such as OPD) can be effectively transferred to the image generation field, providing cross-modal reference ideas for AI research.

New Paradigm for Multi-Task Alignment

By separating expert training and knowledge integration, it provides a general framework for solving the seesaw effect in multi-objective optimization, which can be applied to other AI systems that balance multiple objectives.

Scalable Alignment Paradigm

Flow-OPD is positioned as a 'scalable alignment paradigm for building general text-to-image models'. As image generation models develop, this systematic alignment method will become more important.

6

Section 06

Conclusion: Technical Value and Future Potential of Flow-OPD

Flow-OPD represents an important progress in post-training technology for image generation models. By integrating LLM policy distillation and Flow Matching, it solves core problems such as sparse rewards and gradient interference, and achieves significant performance improvements. This work lays a technical foundation for the development of next-generation general image generation models and demonstrates the huge potential of cross-domain technology integration.