ORPO: Redefining LLM Training for RLHF

No Image
No Image
Source Link

Are you tired of the resource-intensive and complex process of training state-of-the-art Language Model Models (LLMs) for Reinforcement Learning from Human Feedback (RLHF)? Look no further. Introducing Odds Ratio Preference Optimization (ORPO), a groundbreaking approach that redefines how we align and train LLMs for RLHF, bypassing the traditional Base Model → Supervised Fine-tuning → RLHF (PPO/DPO) pipeline. With ORPO, we propose a new method that combines Supervised Fine-tuning (SFT) and Alignment into a single objective, achieving unparalleled results with simplicity and efficiency. Here's how it works: Create a pairwise preference dataset (chosen/rejected), such as Argilla UltraFeedback. Ensure the dataset excludes instances where chosen and rejected responses are identical or one is empty. Select a pre-trained LLM, like Llama-2 or Mistral. Train the Base model with ORPO objective directly on the preference dataset, eliminating the need for an extra SFT step. Key Insights: ORPO is model-free and memory-friendly, providing a seamless training experience. By replacing the complex pipeline of SFT followed by DPO/PPO with a single method, ORPO simplifies the process while achieving superior performance. ORPO outperforms SFT, SFT+DPO on PHI-2, Llama 2, and Mistral, showcasing its effectiveness across various benchmarks. Mistral ORPO achieves remarkable scores: 12.20% on AlpacaEval2.0, 66.19% on IFEval, and 7.32 on MT-Bench, setting new standards in the field, available through Hugging Face Zephyr Beta.