The architecture of Trinity-RFT can be divided into three layers:
Data Layer: Responsible for data loading, preprocessing, and batch management. Supports multiple data formats, including conversational data, preference pair data, and trajectory data with reward signals. The framework has built-in data validation and cleaning mechanisms to ensure the quality of input data.
Training Layer: This is the core of the framework, implementing multiple reinforcement learning algorithms. In addition to standard PPO, it also supports:
- DPO (Direct Preference Optimization): Directly optimizes using preference pair data without explicitly training a reward model.
- KTO: A human decision model based on prospect theory, better simulating humans' asymmetric perception of gains and losses.
- Online/Offline Hybrid Training: Supports flexible switching between pre-collected data and newly generated data.
Inference Layer: Responsible for model inference and sampling. Supports integration with high-performance inference engines like vLLM and Text Generation Inference, significantly improving training efficiency.