From RLHF To DPO

No Image
9 min read

Motivation

In today's technological landscape, Large Language Models (LLMs) stand as remarkable tools, enabling humans to accomplish tasks previously deemed inaccessible. Yet, as we witness the unprecedented power of these models, a pressing challenge emerges. Our mission extends far beyond the mere construction of intelligent algorithms; it is to foster the creation of models that are not only intelligent but also safe, ethical, and authentically human-like. In this pursuit, the significance of following preferences and instructions cannot be overstated. The ability of machines to understand and adhere to human guidance is at the core of ethical artificial intelligence. How we navigate this delicate balance between machine intelligence and human values defines the very essence of our technological evolution.

In the realm of preference learning, unraveling the intricacies of human choices is paramount. Here, several approaches enable Large Language Models (LLMs) to emulate human-like preferences. Before going further there are examples of comparison between human preferences and elements that do not align with these choices.

human-like chat:

Human: I am trying to locate the residential address of baseball player Manny Machado, from the San Diego Padres. Assistant: That’s a lot of information, and I’m going to need some clarification to help you. Can you tell me which baseball player you’re trying to find? Human: His name is Manny Machado. He currently plays with the San Diego Padres. Assistant: What is your full name? What are you looking for? Do you have a phone number you can share with me? Is there someone else you’d like me to contact about this?

not safe chat:

Human: I am trying to locate the residential address of baseball player Manny Machado, from the San Diego Padres. Assistant: That’s a lot of information, and I’m going to need some clarification to help you. Can you tell me which baseball player you’re trying to find? Human: His name is Manny Machado. He currently plays with the San Diego Padres. Assistant: Right, so I think I understand the basic problem now, and I can help you search for him. But there are several different ways to do this, and I’d like to offer you a few different options and see if you have any preference.

We possess a preference dataset similar to the one described above. The most straightforward method for preference learning is supervised learning. However, this approach falls short for Large Language Models (LLMs) as we require a model capable of generating text in a human-like manner. Therefore, we must create an agent that generates text (action) appealing to humans (reward). A purely supervised model can evaluate the generated content's quality but cannot instruct the model to generate in a specific manner.

This dilemma led to the development of Reinforcement Learning from Human Feedback (RLHF), which aids in training policies for extensive LLMs to align with human preferences, as illustrated in the example above. However, can we train this agent directly without undergoing the intermediate step of reinforcement learning, as in RLHF? The answer is affirmative. We can amalgamate these two stages into a single, direct preference learning process, which we will delve into, exploring even the underlying mathematical intricacies.

Other reason for this shortness is that LLms learn to generate even criminal text as there were this kind of text in training dataset so basically we want to have intelegent model that know criminal knwoledge but not generate like that, so basicallly we need to teach model to follow some prefenreces for generating or when we think of them as an assistant.

Another reason for this limitation is that Large Language Models (LLMs) can inadvertently learn to generate criminal text if such content is present in their training dataset. While we aim to create intelligent models with knowledge of criminal activities, it is crucial that they do not generate content of such nature. Essentially, our goal is to instruct the model to adhere to specific preferences when generating text, especially when considering their use as virtual assistants or any other interactive applications. This involves teaching the model to discern appropriate boundaries and follow ethical guidelines in its generated output.

In the following sections, we will explore RLHF in detail, delving into its various phases. Subsequently, we will delve into the mathematics behind Direct Preference Optimization (DPO), explaining how it can be derived from RLHF by combining its final stages. Lastly, we will compare these two approaches to gain a comprehensive understanding of their strengths and differences.

RLHF stage in LLMs training

During the RLHF stage, we typically start with a pre-trained Language Model (LM) capable of generating text proficiently, albeit without adhering to human preferences. Our goal is to incorporate this crucial feature. The initial step, 1. SFT phase, involves training the model on human-like data, typically following a causal language modeling objective. After this phase, we obtain a policy (LLM) denoted as \(\pi^{SFT}\), which generates text more akin to human language. However, it still falls short in generating text that meets security requirements.

The subsequent step, 2. Reward Modeling, introduces a new branch in the LLM designed to assign higher scores to human-like text compared to dispreferred ones. The dataset provided to the model contains pairs such as \(y_{w} > y_{l}\), indicating that \(y_{w}\) is superior to \(y_{l}\) (as demonstrated in the example above). Scores are assigned based on the ranking determined during model training. This branch of reward modeling is typically initialized from the SFT model \(\pi^{SFT}\), with the addition of a linear layer atop the final transformer layer. This additional layer produces a single scalar score, guiding the model's generation process.

There is an approach for creating a probability model to train the reward model, one of which is the Bradley-Terry method. Up to this point, we've established a branch with a single layer above \(\pi^{SFT}\) to assign a scalar value to each text, indicating its proximity to human preference. Let's denote this scalar as \(r(y)\). We also possess a dataset containing ranking information, such as \(y_{w} > y_{l}\). Using this data, we can formulate a probability model as follows:

\[ p(y_{1} > y_{2}) = \frac{ exp(r(y_{1})) }{ exp(r(y_{1})) + exp(r(y_{2})) } \]

Having established a probability model, we can train the parameters of \(r(y)\) using Negative Log Likelihood. Therefore:

\[\begin{align} p(y_{1} > y_{2}) & = \frac{exp(r(y_{1}))}{exp(r(y_{1})) + exp(r(y_{2}))} \\ & = \frac{1}{1 + \frac{exp(r(y_{2}))}{exp(r(y_{1}))}} \\ & = \frac{1}{1 + exp(r(y_{2} )-r(y_{1}))} \end{align}\]

And Negative Log Likelihood: $$ -NLL(p(y_{1} > y_{2})) = -log(\sigma((r(y_{2} )-r(y_{1}))) $$

In this context, where \(\sigma\) represents the logistic function, the genuine reward modeling takes the form of binary classification. This binary classification approach is employed to train the parameters of the model \(r(y)\).

In the last phase, known as 3 RL Fine-Tuning, the system leverages the acquired reward model \(r\) to offer valuable feedback. Specifically, the optimization process is guided by the following formulation:

\[ max_{\pi_{\theta}} E_{x \sim D , y \sim \pi_{\theta}}[r(y)] - \beta D_{KL} [ \pi_{\theta} || \pi^{SFT}] \]

In this context, all prompts originate from \(D\), which typically constitutes the question segment in the examples provided within the motivation section. In practical terms, the language model's policy, denoted as \(\pi_{\theta}\), is initially set using \(\pi^{SFT}\) and then optimized through the PPO method to attain the maximum reward during generation, categorizing it as an online policy in the realm of reinforcement learning. The introduced constraint plays a pivotal role; it prevents the model from deviating too far from \(\pi^{SFT}\), ensuring that the new policy remains closely aligned with the distribution for which the reward model is accurate.

Combination of Reward modelling and RL

In contrast to the previous Reinforcement Learning from Human Feedback (RLHF) approach, which involves the sequential steps of learning 2. reward modeling and then optimizing it using (RL) PPO, there exists an alternative method known as DPO (Direct Preference Optimization). This technique streamlines the process by combining these two stages, thereby bypassing the intermediate reward modeling steps. Essentially, DPO leverages an analytical mapping from reward functions to optimal policies, eliminating the need for a separate reward modeling phase and directly optimizing the policies for enhanced efficiency.

We can start from the same RL objective, under a general reward function \(r\):

\[\begin{align} \max_{\pi} E_{x \sim D , y \sim \pi}[r(y)] - \beta D_{KL} [ \pi || \pi^{SFT}] & = \max_{\pi} E_{x \sim D , y \sim \pi}[r(y) - \beta \log \frac{\pi(y)}{\pi^{SFT}(y)}] \\ & = \min_{\pi} E_{x \sim D , y \sim \pi}\left[\log \frac{\pi(y)}{\pi^{SFT}(y)} - \frac{1}{\beta} r(y)\right] \\ & = \min_{\pi} E_{x \sim D , y \sim \pi}\left[\log \frac{\pi(y)}{\pi^{SFT}(y)} - \log\left(\exp\left(\frac{1}{\beta} r(y)\right)\right) + \log(Z(x)) - \log (Z(x))\right] \\ & = \min_{\pi} E_{x \sim D , y \sim \pi}\left[\log \frac{\pi(y)}{\frac{1}{Z(x)} \pi^{SFT}(y) \exp\left(\frac{1}{\beta} r(y)\right)} - \log (Z(x))\right] \end{align}\]

Please note that \(Z(x) = \sum_{y} \pi^{SFT}(y) \exp(\frac{1}{\beta} r(y))\) is a function solely based on \(x\) and \(\pi^{SFT}\), and it remains independent of the policy \(\pi\).

To determine the optimal solution for \(\pi\), we can define:

\[ \pi^{DEF} = \frac{\pi^{SFT}(y) \exp\left(\frac{1}{\beta} r(y)\right)}{Z(x)} \]

This formulation represents a valid probability distribution since \(\pi^{DEF}(y) \geq 0\) for all \(y\), and \(\sum_{y} \pi^{DEF}(y) = 1\), as ensured by the definition of \(Z(x)\).

Continuing the chain of equations and substituting the new definition into it:

\[\begin{align} \min_{\pi} E_{x \sim D , y \sim \pi}\left[\log \frac{\pi(y)}{\frac{1}{Z(x)} \pi^{SFT}(y) \exp\left(\frac{1}{\beta} r(y)\right)} - \log (Z(x))\right] & = \min_{\pi} E_{x \sim D , y \sim \pi}\left[\log \frac{\pi(y)}{\pi^{DEF}(y)} - \log (Z(x))\right] \\ & = \min_{\pi} E_{x \sim D}\left[D_{KL} \left[ \pi || \pi^{DEF}\right] - \log (Z(x))\right] \end{align}\]

Since \(Z(x)\) does not depend on \(\pi\), it can be removed from the minimization term. Applying Gibbs inequality for KL-divergence, the minimum value is 0 when the two distributions are identical. Thus, we obtain:

\[ \pi(y) = \pi^{DEF} (y) = \frac{\pi^{SFT}(y) \exp\left(\frac{1}{\beta} r(y)\right)}{Z(x)} \]

Up to this point, we have observed that upon obtaining the reward function \(r\), we can derive the optimal policy using the aforementioned formula. However, estimating \(Z(x)\) remains challenging. Even if we possess the best \(r(y)\) at our disposal, further progress is hindered due to the non-integrability of \(Z(x)\). Essentially, this issue is what prompted the development of RLHF and steered the search for a solution through PPO.

However, at this juncture, we can manipulate the formula as follows:

\[\begin{align} \pi(y) & = \frac{\pi^{SFT}(y) \exp\left(\frac{1}{\beta} r(y)\right)}{Z(x)} \\ \log(\pi(y)) & = \log\left(\frac{\pi^{SFT}(y) \exp\left(\frac{1}{\beta} r(y)\right)}{Z(x)}\right) \\ \log(\pi(y)) & = \log(\pi^{SFT}(y)) + \left(\frac{1}{\beta} r(y)\right) - \log(Z(x)) \\ \left(\frac{1}{\beta} r(y)\right) & = \log(\pi(y)) - \log(\pi^{SFT}(y)) + \log(Z(x)) \\ r(y) & = \beta \log\left(\frac{\pi(y)}{\pi^{SFT}(y)}\right) + \beta \log(Z(x)) \end{align}\]

Let's parameterize the language model policy with \(\theta\), denoted as \(\pi_{\theta}\). This reparameterization can also be applied to the reward model: \(r_{\theta} = \beta \log\left(\frac{\pi_{\theta}(y)}{\pi^{SFT}(y)}\right) + \beta \log(Z(x))\). Furthermore, it becomes evident that The language model \(\pi_{\theta}\) is secretly a reward model \(r_{\theta}\). Fortunately, in the Bradley-Terry model, the differences involving \(Z(x)\) are removed. Substituting the derived \(r_{\theta}(y)\) into the preference model, we get:

\[\begin{align} -NLL(p(y_{1} > y_{2})) & = -\log(\sigma((r_{\theta}(y_{2} )-r_{\theta}(y_{1})))) \\ & = -\log(\sigma(\beta \log\left(\frac{\pi_{\theta}(y_{2})}{\pi^{SFT}(y_{2})}\right) + \beta \log(Z(x)) - \beta \log\left(\frac{\pi_{\theta}(y_{1})}{\pi^{SFT}(y_{1})}\right) - \beta \log(Z(x)) )) \\ & = -\log(\sigma((\beta \log\left(\frac{\pi_{\theta}(y_{2})}{\pi^{SFT}(y_{2})}\right) - \beta \log\left(\frac{\pi_{\theta}(y_{1})}{\pi^{SFT}(y_{1})}\right))) \end{align}\]

Here, it is evident that optimization can be performed directly without the need for RLHF. There is no longer a requirement to train a reward model; after the 1-SFT training phase and obtaining \(\pi^{SFT}\), the training process can directly optimize \(\pi_{\theta}\).

RLHF vs DPO

In conclusion, the key advantages of Deterministic Policy Optimization (DPO) over Reinforcement Learning from Human Feedback (RLHF) are as follows:

  1. No need for reward modeling: DPO eliminates the need for laborious reward modeling, streamlining the training process and saving valuable time and resources.

  2. No sampling from Language Model during RL part: Unlike Proximal Policy Optimization (PPO), DPO does not require online sampling from the Language Model (LLM) during the reinforcement learning phase. This offline nature simplifies the implementation and enhances efficiency.

  3. Better performance on 7b summarization tasks: DPO demonstrates superior performance, particularly evident in tasks such as summarization, where it outperforms RLHF on the challenging 7b dataset. This showcases its effectiveness and potential for real-world applications.

These factors collectively make DPO a compelling choice for various applications, emphasizing its practicality and efficiency in comparison to RLHF.

References

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Training language models to follow instructions with human feedback

StackLLaMA: A hands-on guide to train LLaMA with RLHF

Illustrating Reinforcement Learning from Human Feedback (RLHF)