Stack V2: Massive Code Dataset

2024 February, 29

Source Link

BigCode and Software Heritage unveil Stack V2, a monumental leap in code pre-training datasets for Code-LLMs. This robust dataset boasts over 3 billion files spanning 600+ programming and markup languages, making it approximately four times larger than its predecessor, Stack V1, with a staggering size of 67.5TB. Including GitHub issues and PRs, Project Jupyter and Kaggle notebooks, as well as programming documentation and textbooks, Stack V2 offers unparalleled diversity and quality. Derived from the Software Heritage archive, it undergoes extensive filtering and processing to ensure optimal training data quality. Utilized in training StarCoder v2, Stack V2 is now accessible on Hugging Face, heralding a new era of code language modeling.