Salesforce Paper: Online RLHF Enhances LLMs

No Image
No Image
Source Link

A team at Salesforce has released a paper demonstrating the benefits of online iterative Reinforcement Learning with Human Feedback (RLHF) for improving large language models (LLMs). The study shows that online RLHF methods, such as online iterative Direct Policy Optimization (DPO), significantly outperform traditional offline methods. Implementation Steps: Train or select a Reward Model to rate responses. Choose a Supervised Fine-Tuned (SFT) LLM as the “Policy” and gather a set of training inputs (e.g., 60k prompts) divided into iterations (e.g., 3 sets of 20k). Use the Policy to generate multiple samples for each prompt. Rate these samples and create preference pairs of the best and worst responses. Apply DPO to train the Policy on these preference pairs. Repeat the process, using the previously trained DPO LLM in subsequent iterations. Key Insights: An iteration size of 20k prompts with 8 responses per prompt was used. Diverse prompts and filtering low-quality data are crucial for effectiveness. Iterative DPO outperformed offline DPO on 7 out of 9 benchmarks. Iterative DPO Llama 3 8b surpassed Meta Llama 3 8B Instruct on MT Bench and Chat-Arena-Hard. Including a length penalty in the reward calculation reduced length bias. The code, models, dataset, and training details are open-source. This paper provides a reproducible recipe for online iterative RLHF, offering a significant advancement in the fine-tuning and performance of LLMs.