Noisy Embeddings Finetuning Llama2

No Image
8 min read

Motivation:

This study reveals that language model finetuning can undergo significant improvements, sometimes even transformative ones, with the implementation of a simple yet powerful augmentation technique. NEFTune, an approach that introduces noise to embedding vectors during training, proves to be a game-changer. For instance, when finetuning the LLaMA-2-7B model using Alpaca, the standard approach achieves a 29.79% accuracy on AlpacaEval. However, incorporating noisy embeddings through NEFTune boosts this performance to an impressive 64.69%. NEFTune's impact extends beyond Alpaca, showcasing its effectiveness by surpassing strong baselines on modern instruction datasets. Models trained with Evol-Instruct demonstrate a 10% improvement, while ShareGPT and OpenPlatypus both exhibit substantial 8% enhancements. Notably, even potent models, fine-tuned with RLHF, such as LLaMA-2-Chat, benefit significantly from additional training with NEFTune.

this technique is so similar to techniques in paper titled "Noisy Networks for Exploration" (published in 2017). In this approach, the authors added parametric noise to the weights of neural networks to encourage exploration in reinforcement learning agents. The idea is that the added noise helps the agent to explore a broader range of actions, potentially leading to better policy learning.

Import

from functools import wraps
from typing import Dict, List, Tuple, Union, Optional

import torch
from torch import nn

from transformers import AutoModel, Trainer
from transformers.modeling_utils import unwrap_model

Noisy Embedding:

In this method, we introduce random noise to the output of the embedding layers. These embedding layers are positioned before transformer blocks, serving to enhance the robustness of our network to noise. To achieve this, it is essential to modify the forward method of the embedding layers. By creating a new forward method and applying it to these layers, we can effectively bolster our network against external disturbances. The following represents the updated forward method:

def neftune_post_forward_hook(module, input, output):
    """
    Implements the NEFTune forward pass for the model using forward hooks. Note this works only for
    torch.nn.Embedding layers. This method is slightly adapted from the original source code
    that can be found here: https://github.com/neelsjain/NEFTune

    Simply add it to your model as follows:
    ```python
    model = ...
    model.embed_tokens.neftune_noise_alpha = 0.1
    model.embed_tokens.register_forward_hook(neftune_post_forward_hook)
    ```

    Args:
        module (`torch.nn.Module`):
            The embedding module where the hook is attached. Note that you need to set
            `module.neftune_noise_alpha` to the desired noise alpha value.
        input (`torch.Tensor`):
            The input tensor to the model.
        output (`torch.Tensor`):
            The output tensor of the model (i.e. the embeddings).
    """
    if module.training:
        dims = torch.tensor(output.size(1) * output.size(2))
        mag_norm = module.neftune_noise_alpha / torch.sqrt(dims)
        output = output + torch.zeros_like(output).uniform_(-mag_norm, mag_norm)
    return output

Suppose we aim to incorporate the newly defined forward function into our models. To achieve this, let's consider a scenario where we load models, such as Llama2, from the transformers library. Further, let's assume we're engaging in QLoRa fine-tuning. Alongside these models, we utilize the Trainer class from the transformers library (from transformers import Trainer). To seamlessly integrate our custom forward method, neftune_post_forward_hook, into the training process, we introduce a custom ranker. This is achieved by implementing the NoisyRanker class, as illustrated below:

class NoisyTrainer(Trainer):
    def __init__(self, neftune_noise_alpha: Optional[float] = None, *args, **kwargs):
        self.neftune_noise_alpha = neftune_noise_alpha
        super(NoisyTrainer, self).__init__(*args, **kwargs)

    @wraps(Trainer.train)
    def train(self, *args, **kwargs):
        # Activate neftune right before training.
        if self.neftune_noise_alpha is not None:
            self.model = self._active_neftune(self.model)

        output = super().train(*args, **kwargs)

        # After training, we make sure to retrieve back the original forward pass method
        # for the embedding layer by removing the forward post hook.
        if self.neftune_noise_alpha is not None:
            unwrapped_model = unwrap_model(self.model)
            embeddings = unwrapped_model.base_model.model.get_input_embeddings()
            self.neftune_hook_handle.remove()
            del embeddings.neftune_noise_alpha

        return output

    def _active_neftune(self, model):
        unwrapped_model = unwrap_model(model)
        embeddings = unwrapped_model.base_model.model.get_input_embeddings()
        embeddings.neftune_noise_alpha = self.neftune_noise_alpha
        hook_handle = embeddings.register_forward_hook(neftune_post_forward_hook)
        self.neftune_hook_handle = hook_handle
        return model

Subsequently, we can employ the NoisyTrainer for the training process as depicted below:

trainer = NoisyTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

trainer.train()

Conclusion

In the context of the aforementioned blog post, the introduction of noisy input assumes a crucial role in fortifying the language model fine-tuning process. By incorporating random noise into the output of the embedding layers, strategically positioned before transformer blocks, the network acquires robustness to noise, thereby contributing to enhanced regularization. This intentional introduction of noise serves as a form of regularization, promoting adaptability and resilience during training and fostering improved model performance. Simultaneously, regularization techniques, specifically designed to mitigate overfitting, further contribute to the model's stability and ability to generalize effectively to diverse input scenarios. The meticulous integration of both noisy input and regularization mechanisms establishes a balanced and effective training framework for the language models elucidated in the blog post.