Section 01
[Introduction] FIPO: A New Pure Reinforcement Learning Method Breaking the Reasoning Length Bottleneck of Large Models
FIPO (Future-KL Influenced Policy Optimization), open-sourced by Alibaba Tongyi Lab, is a value-free reinforcement learning method. It extends the chain-of-thought length to over 10000 tokens via a fine-grained token-level credit assignment mechanism, achieving 58% accuracy in AIME 2024—surpassing DAPO and o1-mini—and opening a new path for training large models' reasoning capabilities using pure RL.