Meta's RPO Boosts LLM Reasoning
Meta introduces Iterative Reasoning Preference Optimization (RPO), a groundbreaking method aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through iterative preference tuning and Chain-of-Thought (CoT) techniques. Leveraging the concept of Self-Rewarding Language Models, RPO iteratively optimizes competing CoT steps and answers with Differential Preference Optimization (DPO), as outlined in the "Self-Rewarding Language Models" paper authored by similar researchers. The methodology involves selecting an SFT LLM and a set of training inputs, generating multiple CoT reasoning steps and final answers, and creating preference pairs where the correct answer is chosen over the incorrect one. These pairs are used to train the SFT LLM using DPO, and the process is repeated iteratively, with the previous DPO LLM being utilized in subsequent iterations. Results showcase significant improvements in accuracy across various benchmarks, including a rise from 55.6% to 81.6% on GSM8K, from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on ARC-Challenge. Furthermore, Iterative Reasoning DPO demonstrates superior performance compared to standard DPO, achieving approximately 3% and 20% improvement on ARC and GSM8K, respectively. Notably, the inclusion of negative log-likelihood (NLL) in the DPO loss function significantly enhances learning effectiveness. However, challenges remain, such as the absence of evaluation or reward for reasoning steps, and the use of a fixed set of prompts for each iteration. Despite these challenges, the binary reward system, based on exact match of final answers, proves effective. The main hurdle lies in computing rewards, particularly in dialogue tasks where relying solely on final answers may not be feasible. Nonetheless, Meta's RPO presents a promising avenue for advancing the reasoning capabilities of LLMs through iterative preference tuning and CoT optimization.