MAP-Neo's Comprehensive Guide to Open-Source LLM Training
MAP-Neo has released an in-depth paper detailing the development and training of its open-source Large Language Model (LLM). This comprehensive guide covers various aspects including tokenizer, data preprocessing (filtering, deduplication, quality control), model architecture, training, and fine-tuning. A must-read for anyone interested in training LLMs! Highlights of the paper include: The model uses a decoder with multi-query attention, RoPE embeddings, RMSNorm, and SwiGLU activation. The MAP-Neo 7B model is trained on 4.5 trillion tokens with an 8k context length, achieving scores of 58.14% on MMLU, 53.68% on GSM8K, and 23.8 on HumanEval. Data preprocessing techniques include Minhash LSH and exact substring deduplication, heuristic filtering rules for low-quality data, and both document-level and sentence-level filtering. An OCR pipeline was created to convert PDFs with text, formulas, and tables into markdown format. Pre-training data composition: 52.55% from Common Crawl, 22.29% from programming code, with the rest sourced from academic papers, books, and other printed materials. The pretraining process involved two stages, with a decay phase focusing on high-quality data and an increased proportion of code data. Fine-tuning included two stages: single-turn instruct tuning on 2M samples (Open Hermes and Code-Feedback) followed by chat tuning on 100k multi-turn dialogues. Iterative DPO was applied for preference tuning using Nectar as the prompt dataset and Starling-RM-34B as the reward model. Dive into this detailed paper to learn about the intricacies of training and fine-tuning an open-source LLM, and gain insights that can enhance your own LLM projects.