Generate High-Quality Synthetic Datasets at Home

2024 June, 17

Source Link

Discover how you can generate high-quality synthetic datasets at home, comparable to those created by GPT-4. The new self-synthesis method utilizes Llama 3 70B to produce extensive instruction datasets from "empty" user messages. The process involves fine-tuning a model like Llama 3 70B, creating templates for user messages, and generating synthetic dialogues. Generated samples are then filtered to ensure quality and diversity. This method has produced 4 million instruction pairs, filtered to 300,000 high-quality pairs, outperforming or matching other open LLMs fine-tuned on GPT-4 data. Additionally, it matches or outperforms Llama-3-8B-Instruct on certain benchmarks. Llama 3 8B aids in sample categorization and classification, with sentence transformers used for similarity matching. This technique supports multi-turn data generation, and all prompts used are detailed in the accompanying paper. Creating 1,000 high-quality samples costs approximately $1.1, and the dataset is released under CC BY-NC.