Enhancing Synthetic Data Quality with Repeated Ranking

No Image
No Image
Source Link

Lightblue has optimized its synthetic preference dataset pipeline by implementing a repeated ranking process to improve data quality. This method involves several key steps: Collecting a diverse set of prompts from public datasets in multiple languages. Generating multiple responses using various open and closed Large Language Models (LLMs). Using a high-quality evaluator (e.g., GPT-4) to rank the responses for each prompt. Repeating the ranking process multiple times (e.g., 5x) for each set of responses. Calculating Kendall's W and filtering out responses with low-ranking consistency. Creating a preference-pair dataset and tuning LLMs using methods like ORPO, DPO, or SimPO. Key insights from this approach include: Utilizing 2,714 prompts in 62 languages and generating responses from 7 LLMs. Achieving superior performance on the MT-Bench in six languages through repeated ranking. Demonstrating that models trained on 25%, 50%, or 75% of consistently ranked data outperform models trained on all data. Reducing training time by up to 2-4x with higher quality data. Emphasizing that quality surpasses quantity in Reinforcement Learning from AI Feedback (RLAIF) dataset generation. This method highlights a significant advancement in the creation of high-quality synthetic datasets, showing that repeated ranking can lead to better performing models with reduced training times.