Section 01
[Main Floor] Introduction to Inhuman Optimization: Exploring the Limits of Reward Models in Large Language Model Alignment
Frank Dougherty, an undergraduate at the University of Notre Dame, conducted an in-depth study on the limitations of reward models in RLHF in his graduation thesis Inhuman Optimization, revealing key issues such as reward hacking and over-optimization, which provides important references for AI safety research. This thread will explore its core content in separate floors.