LLM Compute Allocation: MoD Approach
In a groundbreaking development, Google DeepMind introduces "Mixture-of-Depths (MoD)," a pioneering paper exploring dynamic compute allocation for Large Language Models (LLMs). This innovative approach enables LLMs to optimize compute budget by dynamically allocating resources based on token positions within a sequence. With MoD, the number of tokens participating in self-attention and MLP computations is reduced, enhancing efficiency without compromising performance. The introduction of a top-k Router further refines this process, learning which tokens should be processed at each layer. This breakthrough has the potential to revolutionize the landscape of LLM training, paving the way for more efficient and scalable models.