Transformers: T in GPT Models

No Image
15 min read

Motivation

We have heard a lot about GPT models, but less attention is paid to the design of what makes GPT models possible. In this blog, we aim to draw a picture of what is behind GPT models: Transformers. There is a wealth of documentation about Transformers, including concepts like self-attention, decoders, and encoders, which might seem overwhelming. Here, we have tried to simplify everything and sketch the right picture. The design is not as complicated compared to other networks like LSTM, but you might need a fresh start, which you can find here in this blog post.

Steps to Understand Transformers

In this blog, you will find the titles of the sections we will cover, helping you to better understand the different parts of Transformers. We define the goals and step-by-step explain the solutions scientists found, leading to the invention of Transformers.

Goal 1: Design a method to create a single vector representation of a sequence. For example, given (s1, s2, ..., s_{n}), how can we represent it in a way that includes all information? Additionally, the sequence length may vary, as sentences typically have different lengths. We also need to track the position because the permutation of the sequence should convey different meanings. These conditions are necessary for representing the sequence as a vector. To achieve this goal, we need to set an objective for training these representations, which should be clearly defined along with other conditions.

Solutions for Goal 1:

  • Self-Attention: Captures all the information from sequences of different lengths.
  • Position Embedding: Keeps track of positions.
  • Masked Language Model: Defines the objective.

Goal 2: Define a method to transform one sequence (s1, s2, ..., s_{n}) to another sequence (o1, o2, ..., o_{m}), considering the above conditions but probably with a different objective. Here, we are directly looking to map from s to o.

Solutions for Goal 2:

  • Encoder-Decoder Attention: Captures all the information from sequences of different lengths.
  • Position Embedding: Keeps track of positions.
  • Output Likelihood: Defines the objective.

Blog Titles: Combining all the concepts, the titles in this blog will include:

  • (Masked) Self-Attention
  • Position Embedding
  • Masked Language Model
  • (Cross) Encoder-Decoder Attention
  • Output Likelihood

(Masked) Self-Attention

Let's consider a sequence \((x_1, x_2, \ldots, x_{n}, x_{n+1})\). First, we have an input embedding layer that maps each \(x_i\) to a vector of dimension d. Self-attention is a mechanism to create rich embeddings for each element in the sequence by considering the other elements.

From each \(x_i\), we derive three vectors: query (\(q\)), key (\(k\)), and value (\(v\)), by multiplying \(x_i\) with learnable parameters. To create a rich embedding for each \(x_i\), we take an average of all \(x_i\) values weighted by their contribution, which is determined by the similarity between the queries and keys.

Detailed Steps:

Embedding and Linear Transformation: - Each \(x_i\) is mapped to a vector \(e_i\) of dimension d through the embedding layer. - For each \(e_i\), we compute the query \(q_i\), key \(k_i\), and value \(v_i\) vectors using learned weight matrices \(W_q\), \(W_k\), and \(W_v\):

\[\begin{align} q_i = e_i W_q, \quad k_i = e_i W_k, \quad v_i = e_i W_v \end{align}\]

Attention Score Calculation: - For each query \(q_i\), compute the attention scores with all keys \(k_j\) in the sequence. This is done using the dot product:

\[\begin{align} \text{AttentionScore}(i, j) = q_i \cdot k_j^T \end{align}\]

Softmax to Get Attention Weights: - Convert the attention scores to probabilities using the softmax function, scaled by the square root of the dimension \(d_k\):

\[\begin{align} \text{AttentionWeight}(i, j) = \text{softmax}\left(\frac{\text{AttentionScore}(i, j)}{\sqrt{d_k}}\right) \end{align}\]

Weighted Sum of Value Vectors: - Compute the output for each position \(i\) by taking the weighted sum of all value vectors \(v_j\):

\[\begin{align} \text{Output}_i = \sum_{j=1}^{n+1} \text{AttentionWeight}(i, j) \cdot v_j \end{align}\]

Masked self-attention is a variation used primarily in autoregressive models, where the model should not peek at future tokens. In this case, the attention mechanism is masked to ensure that, when computing the attention for position \(i\), only positions up to \(i\) are considered.

For position \(n+1\), only the query vector \(Q_{n+1}\) is directly involved:

  • Query Vector: The query vector \(Q_{n+1}\) (derived from \(x_{n+1}\)) is used.
  • Key Matrix: The key vectors \(K_1, K_2, \ldots, K_{n+1}\) (derived from \(x_1, x_2, \ldots, x_{n+1}\)) are used.
  • Value Matrix: The value vectors \(V_1, V_2, \ldots, V_{n+1}\) (derived from \(x_1, x_2, \ldots, x_{n+1}\)) are used.

(Masked) Detailed Steps:

Attention Scores Calculation:

\[\begin{align} \text{AttentionScore}(n+1, j) = Q_{n+1} \cdot K_j^T \quad \text{for} \quad j = 1, 2, \ldots, n+1 \end{align}\]

Softmax to Get Attention Weights:

\[\begin{align} \text{AttentionWeight}(n+1, j) = \text{softmax}\left(\frac{Q_{n+1} \cdot K_j^T}{\sqrt{d_k}}\right) \end{align}\]

Weighted Sum of Value Vectors:

\[\begin{align} \text{Output}_{n+1} = \sum_{j=1}^{n+1} \text{AttentionWeight}(n+1, j) \cdot V_j \end{align}\]

By following these steps, we can compute the attention output for each position in the sequence, ensuring that the embedding of each element is enriched by considering its context within the sequence.

So basically, keep in mind that instead of naive embeddings, Transformers use all previous tokens to create better embeddings, which are usually called hidden states. The hidden state for the \((n+1)\)-th position, \(h_{n+1}\), is computed as follows:

\[\begin{align} h_{n+1} = \sum_{j=1}^{n+1} \text{AttentionWeight}(n+1, j) \cdot v_j \end{align}\]

where the attention weights are calculated using the softmax function:

\[\begin{align} \text{AttentionWeight}(n+1, j) = \text{softmax}\left(\frac{Q_{n+1} \cdot K_j^T}{\sqrt{d_k}}\right) \end{align}\]

Here, \(Q_{n+1}\) is the query vector for the \((n+1)\)-th position, and \(K_j\) and \(V_j\) are the key and value vectors for the \(j\)-th position, respectively.

Position Embedding

For adding information about the position, there is a technique that maps each position to an embedding of size d and simply adds it to the input embedding of \(x_i\). We can have either learnable or fixed embeddings for the positions. One common fixed embedding is the sinusoidal embedding.

The sinusoidal embeddings are defined to allow the model to easily learn to attend by relative positions. The embedding for a position \(pos\) and dimension \(i\) is defined as:

\[\begin{align} PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) \end{align}\]
\[\begin{align} PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) \end{align}\]

Here:

  • \(pos\) is the position.
  • \(i\) is the dimension.
  • \(d\) is the total dimension of the embeddings.

This results in a unique vector for each position \(pos\), and the same position \(pos\) will always produce the same embedding. These position embeddings are added to the input embeddings to incorporate positional information into the model.

In the context of the attention matrix, which computes the attention scores between query and key vectors, sinusoidal position embeddings enhance the model’s ability to attend to tokens based on their relative positions. By incorporating sinusoidal embeddings, the attention scores can effectively capture the relative distances between tokens in addition to their semantic similarities. This relative information is crucial for tasks where understanding the sequential relationship between tokens is essential, such as in language modeling or machine translation.

Therefore, sinusoidal position embeddings are considered relative because they encode positional information in a way that respects the relative distances between tokens in a sequence, enabling the Transformer model to learn and utilize these positional dependencies effectively through its attention mechanism.

Masked Language Model

So far, we've seen that each token's embedding becomes richer at each step based on the context of other tokens. However, there are other techniques to build Transformer blocks. For now, we assume that each token obtains some embedding vector after several steps, which we can call the output. Above each output vector, there is a layer to find the probability distribution among all tokens.

The Masked Language Model (MLM) is an objective designed to predict masked tokens. In MLM, randomly 15% of the tokens are masked, and the model tries to predict these masked tokens. The objective is to minimize the cross-entropy loss for the masked tokens by obtaining the probability distribution as explained above.

MLM Objective and Cross-Entropy Loss

Given a sequence \(x = (x_1, x_2, \ldots, x_n) \), let \( \hat{x}\) be the sequence with some tokens masked (denoted as \([MASK]\)). The model generates output embeddings \(h = (h_1, h_2, \ldots, h_n)\) for the sequence. The probability distribution over the vocabulary for the masked token at position \(i\) is given by applying a softmax function to the output vector \(h_i\):

\[\begin{align} P(\hat{x}_i = w \mid \hat{x}) = \frac{\exp(h_i \cdot W_w)}{\sum_{v \in V} \exp(h_i \cdot W_v)} \end{align}\]

where \(W_w\) is the embedding of token \(w\) in the vocabulary \(V\).

The MLM objective is to minimize the cross-entropy loss for predicting the masked tokens. The loss for a single masked token is:

\[\begin{align} \mathcal{L}_i = -\log P(\hat{x}_i = x_i \mid \hat{x}) \end{align}\]

The total MLM loss for the sequence is the average loss over all masked tokens:

\[\begin{align} \mathcal{L}_{MLM} = \frac{1}{|\text{MASK}|} \sum_{i \in \text{MASK}} \mathcal{L}_i \end{align}\]

where \(\text{MASK}\) denotes the set of positions of masked tokens in the sequence.

In summary, the MLM objective trains the model to predict masked tokens by minimizing the cross-entropy loss, thereby improving the model's ability to understand and generate language.

As you can see, there isn't any back propagation through time in Transformers, so they can be easily parallelized compared to LSTMs. However, the complexity is \(O(n^2)\) because it needs to calculate the attention matrix. For the masked tokens, the percentage is 15% to be defined as MASK randomly.

We need to understand that for our purpose, decreasing the percentage will cause an increase in computation. This is because fewer masked tokens result in fewer terms involved in the cross-entropy loss. Consequently, to maintain the same steps of learning, we need to run more epochs or perform more computation. In other words, fewer masked tokens send fewer learning signals to the model, so we need to train it more.

(Cross) Encoder-Decoder Attention

In the Transformer architecture, we have an encoder which uses a self-attention mechanism over the input data (with positional embeddings) to create a vector for each token in the input sequence. The encoder uses full attention, meaning each token looks at all other tokens.

For example, in a translation task, we first encode each token. Now, to decode (translate), for each token in the output sequence, the decoder first uses masked self-attention to get the output embedding for the desired token. This helps the model to focus only on what has been seen so far (i.e., the tokens behind the current position). The decoder then uses these embeddings as queries (q), and the keys (K) and values (V) come from the encoder. The similarities between q and each K are used as weights for averaging over V. In the second step, this mechanism is known as cross-attention because q, K, and V are not the same.

Cross-Attention in Decoder

After obtaining the output embeddings from masked self-attention, cross-attention is applied. Here, the query vectors \(q_i\) come from the decoder's previous layer, and the key and value vectors come from the encoder's output:

\[\begin{align} q_i = \text{DecoderOutput}_i W_q, \quad k_j = \text{EncoderOutput}_j W_k, \quad v_j = \text{EncoderOutput}_j W_v \end{align}\]

The cross-attention score is calculated as:

\[\begin{align} \text{CrossAttentionScore}(i, j) = \frac{q_i \cdot k_j^T}{\sqrt{d_k}} \end{align}\]

The cross-attention weights are:

\[\begin{align} \text{CrossAttentionWeight}(i, j) = \text{softmax}\left(\frac{q_i \cdot k_j^T}{\sqrt{d_k}}\right) \end{align}\]

Finally, the cross-attention output for each token \(i\) is:

\[\begin{align} h_i = \sum_{j} \text{CrossAttentionWeight}(i, j) \cdot v_j \end{align}\]

This cross-attention mechanism allows the decoder to utilize the encoded information from the encoder while generating the output sequence.

Output Likelihood

In the Masked Language Model (MLM), we saw that the objective is the negative log likelihood of the masked token. Assigning probabilities is the same for the output tokens, but in an encoder-decoder design, where we use the model generally as a generator, we try to predict the next word. In the output, we use shifted right tokens as labels and add \([Start]\) tokens at the beginning of the output sequence to start the generation process.

Based on the probability of each token, which is obtained from the encoder's information and the cross-attention of what has been seen so far in the output, we decide which token should come next. The training objective is to minimize the negative log likelihood.

Formula for Encoder-Decoder Generation

let's say the final decoder outputs:

\[\begin{align} h_i^{\text{final}} = \sum_{j} \text{CrossAttentionWeight}(i, j) \cdot v_j \end{align}\]

Output Probability and Loss:

  • Apply a linear layer followed by softmax to get the probability distribution over the vocabulary for each token: \begin{align} P(y_i \mid y_{<i}, x) = \text{softmax}(h_i^{\text{final}} W_o + b_o) \end{align}

  • The training objective is to minimize the negative log likelihood of the true tokens in the sequence: \begin{align} \mathcal{L} = -\sum_{i=1}^{m} \log P(y_i \mid y_{<i}, x) \end{align}

By minimizing this loss, the model learns to generate the next token in the sequence based on the previously generated tokens and the encoded input sequence.