DNO: Reinventing Reinforcement Learning
ecent advancements in reinforcement learning have brought forth Direct Nash Optimization (DNO), a groundbreaking method developed by Microsoft Research. DNO stands out for its efficacy in Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from Autonomous and Interactive Feedback (RLAIF) tasks. Unlike its predecessors, DNO employs a batched on-policy algorithm, leveraging contrastive learning for iterative self-improvement. The implementation of DNO follows a meticulous process: Initialization: Start with a Strong Feedback Transformer Language Model (SFT LLM), such as Zephyr SFT, and gather a Preference Pairwise dataset, like Ultrafeedback, or generate a custom one using a potent LLM like GPT-4. Data Preparation: Split the preference dataset into non-overlapping mixes to prevent overfitting. Iterations: Generate new samples using the SFT LLM for each mix. Score these samples using a Judge (e.g., GPT-4) and create preference pairs with existing responses. Filtering: Filter new pairs based on a predefined scoring criterion to ensure high-quality data for training. Algorithm Application: Apply the DNO algorithm, utilizing contrastive loss, to the SFT LLM using the new preference pairs dataset and a portion of old preference pairs. Repeat: Iterate the process until completing one full epoch on all splits. Insights and Implications of DNO: DNO demonstrates a significant improvement, elevating win-rates notably, such as from 7% to 33% for the 7B Orca 2.5 model against GPT-4-Turbo on AlpaceEval. DNO's filtering mechanism enables it to outperform traditional approaches like SPIN, particularly due to its emphasis on quality preference pairs. The depth of iterations holds more influence on performance than the width (number of samples per iteration). Quality preference pairs, even in fewer numbers, prove more beneficial than noisy pairs. DNO adopts an additive scoring framework on a 6-point scale for judging, allowing only highly scored samples to be considered positive. Notably, DNO showcases similarities to iterative DPO (online DPO) but continues to demonstrate improvement with more data without compromising performance on standard benchmarks like MMLU. In essence, DNO emerges as a promising avenue in reinforcement learning, offering enhanced performance and adaptability across various scenarios.