SRPO Method Overview
SRPO (Score-based Reward Preference Optimization) is an innovative alignment framework specifically designed for the characteristics of diffusion models. The core idea of this method is to introduce human preference feedback at each time step of the diffusion process, and guide the model to generate outputs more in line with human expectations through a combination of reward modeling and preference optimization.
Unlike the traditional RLHF (Reinforcement Learning from Human Feedback) method for large language models, SRPO fully considers the unique multi-step denoising process of diffusion models. In diffusion models, the generation process is a progressive refinement process, and each intermediate step has an important impact on the final output quality. SRPO achieves fine-grained control over the generation process by injecting preference signals at multiple time steps.
Technical Principles Detailed Explanation
Combination of Score Function and Reward Model
The core of diffusion models is to learn the score function of the data distribution, which is the gradient of the log probability density. Based on this, SRPO introduces a reward model to quantify human preferences. The reward model takes the intermediate state of the diffusion process as input and outputs a scalar value indicating the degree to which the state conforms to human preferences.
During training, SRPO adopts an ingenious joint optimization strategy: on one hand, it maintains the diffusion model's ability to fit the data distribution; on the other hand, it trains the reward model using preference data and adjusts the diffusion model's score function using reward signals. This design ensures that the model can optimize towards human preferences while maintaining generation diversity.
Multi-Time-Step Preference Learning
A key innovation of SRPO is its support for multi-time-step preference learning. In the denoising process of diffusion models, different time steps correspond to intermediate states with different noise levels. Studies have shown that human preferences for generated content may manifest differently at different stages: early steps may focus more on the overall structure, while later steps focus more on detail quality.
SRPO allows the collection and modeling of these cross-time-step preference data. Through contrastive learning, the model can learn from paired preference comparisons and gradually build a preference representation that runs through the entire diffusion process. This enables the model to consider human aesthetic and practical needs at each denoising step.
Stable Training Strategy
The training of diffusion models itself requires fine parameter tuning, and after introducing preference alignment, the stability of training becomes even more important. SRPO adopts several key technologies to ensure training stability:
First, gradient clipping and adaptive learning rate adjustment. Since reward signals may have noise, directly applying them to diffusion models may lead to training divergence. SRPO ensures smooth updates of model parameters by limiting the size of reward gradients and dynamically adjusting the learning rate.
Second, regularization mechanism. To prevent the model from overfitting to limited preference data and losing generalization ability, SRPO introduces a regularization term based on the original data distribution. This is equivalent to maintaining the model's basic understanding of the real data distribution while optimizing human preferences.