Llama 3: Context Extended to 100k Tokens!
In a remarkable feat of innovation, the Llama 3 model's context capabilities have been extended to nearly 100,000 tokens! Through a combination of techniques, including PoSE (Positional Encoding for Simulating Elongated inputs) and continued pre-training on the Llama 3 8B base for 300 million tokens, the community, led by Wing Lian, achieved a monumental expansion of the context window from 8,000 to 64,000 tokens. Subsequent application of rope scaling further pushed the supported context window to almost 100,000 tokens, achieving perfect recall. PoSE plays a pivotal role in this advancement by simulating long inputs using a fixed context window during training. This approach divides documents into smaller pieces and treats them as "long" versions, effectively reducing memory and time overhead while preserving performance. Key Insights: Avoid increasing rope_theta during pretraining to optimize performance. Rank-stabilized LoRA (Long Range Arena) converges much faster compared to regular LoRA. RoPE theta was increased to approximately 90,000 tokens to extend the context window. Adapters can seamlessly integrate with any Llama 3 model, facilitating further extension of the context. This breakthrough in extending the context window of the Llama 3 model opens up exciting possibilities for handling longer and more complex inputs, paving the way for enhanced performance and versatility in various natural language processing tasks.