WEBINSTRUCT: High-Quality Instruction Dataset from Web Data
Introducing WEBINSTRUCT, a groundbreaking 10M dataset of high-quality instructions sourced directly from web data, without human annotation or reliance on GPT-4. This dataset revolutionizes instruction extraction methodologies by leveraging a custom-trained FastText model to retrieve relevant documents from a pre-training web corpus. Utilizing pattern matching and sequence labeling techniques, WEBINSTRUCT extracts precise question-answer or instruction-response pairs from these documents. To enhance coherence and accuracy, open Large Language Models (LLMs) refine and augment the extracted pairs, addressing formatting issues and adding intermediate steps where necessary. Key insights from the implementation include MAmmoTH2-8x7B-Plus achieving impressive scores of 32.6 on Arena Hard and 81.5 on MixEval, surpassing benchmarks like Mixtral 8x7B and Qwen-1.5-110B. Models designated with *-Plus denote further training on additional public instruction data, contributing to their superior performance. Released under the Apache 2.0 license on Hugging Face, these models and datasets promote diversity and quality in instruction data, crucial for enhancing reasoning capabilities of LLMs. The iterative process of refining instruction-response pairs underscores the effectiveness of open LLMs in continually improving the dataset's utility and relevance.