Scaling Synthetic Data

2024 February, 22

Source Link

Discover how Cosmopedia has achieved a monumental milestone with the release of the largest open synthetic dataset, spanning 25B tokens from diverse sources like textbooks and blog posts. Leveraging Mixtral-8x7B-Instruct-v0.1 and ~16,000 H100 GPU hours, Cosmopedia's approach involves collecting unsupervised data, creating versatile prompts, and employing LLM-swarm and Mixtral for generation. With less than 1% duplicate content, Cosmopedia represents a leap in synthetic data generation, reshaping the landscape of AI training datasets