Llama2 vs Mistral

No Image
12 min read

Mistral vs Llama2: A Comparative Analysis of Large Language Models

In the ever-evolving landscape of natural language processing, large language models have become pivotal in shaping the way machines understand and generate human-like text. Two noteworthy contenders in this arena are Mistral-7B-v0.1 and Llama2, each boasting impressive capabilities and unique features. In this blog post, we'll delve into the details of these models, comparing their architectures, benchmarks, and applications.

Mistral-7B-v0.1: Unleashing the Power of 7 Billion Parameters

Overview: Mistral-7B-v0.1 stands as a testament to the prowess of generative text models, wielding a formidable 7 billion parameters. This large language model (LLM) outshines its counterpart, Llama2 13B, across various benchmarks, showcasing its superior performance in the realm of natural language understanding and generation.

Model Architecture: At its core, Mistral-7B-v0.1 is a transformer model, incorporating innovative architectural choices to enhance its capabilities:

  1. Grouped-Query Attention: This unique attention mechanism allows Mistral to process information more efficiently by grouping related queries together, optimizing contextual understanding.

  2. Sliding-Window Attention: Leveraging a sliding-window approach, Mistral refines its attention mechanism, focusing on specific portions of the input sequence. This facilitates improved context capture and nuanced text generation.

  3. Byte-fallback BPE Tokenizer: Mistral employs a Byte Pair Encoding (BPE) tokenizer as a fallback mechanism, enhancing its versatility in handling a wide range of text inputs.

Performance: Mistral-7B-v0.1 has demonstrated exceptional performance across diverse benchmarks, surpassing Llama2 13B in various tests. For detailed insights, readers are encouraged to refer to the accompanying paper and release blog post.

Llama2: Elevating Dialogue with Large Language Models

Overview: Llama2, a collaborative effort by a team of researchers, introduces a collection of foundation language models ranging from 7 billion to a staggering 70 billion parameters. With a focus on chat applications, Llama2 sets out to redefine the landscape of dialogue-based language models.

Model Architecture: Llama2 encompasses a series of pretrained and fine-tuned large language models, with Llama2-Chat being optimized specifically for dialogue use cases. The model draws on the expertise of its extensive team, incorporating lessons learned from training models across a spectrum of parameters.

Contributions to the Community: The Llama2 project goes beyond showcasing its models' prowess. In the spirit of fostering community collaboration and responsible development, the team provides detailed insights into their fine-tuning approach and safety improvements. This transparent approach encourages the wider community to build upon their work and contribute to the ethical evolution of large language models.

Performance and Safety: Llama2-Chat has exhibited superior performance in dialogue-based applications, surpassing open-source chat models in various benchmarks. Human evaluations for helpfulness and safety indicate that Llama2-Chat may serve as a viable alternative to closed-source models, further emphasizing its potential in real-world applications.

Calculating ROC for Mistral-7B-v0.1 and Llama2: A Deep Dive into Model Comparison

from datasets import load_dataset, Dataset
from transformers import Pipeline, pipeline
from transformers.pipelines.pt_utils import KeyDataset

import numpy as np
import torch
import sklearn

mistral = "mistralai/Mistral-7B-v0.1"
llama2 = "meta-llama/Llama-2-7b-hf"

class PairSimilarityPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        self.tokenizer.pad_token = '[PAD]'
        preprocess_kwargs = {}
        if "second_text" in kwargs:
            preprocess_kwargs["second_text"] = kwargs["second_text"]
        return preprocess_kwargs, {}, {}

    def preprocess(self, example):
        result = {}

        first_text = example['first_text']
        second_text = example['second_text']

        first_inputs = self.tokenizer(first_text, padding="max_length", max_length=50, return_tensors='pt')
        for k, v in first_inputs.items():
            result[f"first_{k}"] = v

        second_inputs = self.tokenizer(second_text, padding="max_length", max_length=50, return_tensors='pt')
        for k, v in second_inputs.items():
            result[f"second_{k}"] = v
        return result

    def _forward(self, model_inputs):
        first_model_output = self.model.model(**{k.replace("first_", ""): v for k, v in model_inputs.items() if "first_" in k})
        second_model_output = self.model.model(**{k.replace("second_", ""): v for k, v in model_inputs.items() if "second_" in k})
        # first_model_output = self.model.model(**model_inputs['first'])
        # second_model_output = self.model.model(**model_inputs['second'])
        return {'first_model_embedding': first_model_output[0],
                'first_attention_mask': model_inputs['first_attention_mask'],
                'second_model_embedding': second_model_output[0],
                'second_attention_mask': model_inputs['second_attention_mask']}

    def mean_pooling(self, model_embedding, attention_mask):
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(model_embedding.size()).float()
        mean_embedding = torch.sum(model_embedding * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return torch.nn.functional.normalize(mean_embedding, p=2, dim=1)

    def postprocess(self, model_outputs):
        normalized_first_vector = self.mean_pooling(**{k.replace("first_", ""): v for k, v in model_outputs.items() if "first_" in k})
        normalized_second_vector = self.mean_pooling(**{k.replace("second_", ""): v for k, v in model_outputs.items() if "second_" in k})

        similarity = torch.sum(normalized_first_vector * normalized_second_vector, dim=1).item()
        return np.max([similarity, 0])

As the field of natural language processing continues to evolve, the need for robust methods to evaluate and compare the performance of large language models becomes increasingly critical. In this segment, we'll explore the implementation of a Receiver Operating Characteristic (ROC) analysis to compare the effectiveness of Mistral-7B-v0.1 and Llama2 in a pair similarity task.

Understanding the Code: The provided Python code is a snippet of a Pair Similarity Pipeline, which is a crucial component for evaluating the similarity between pairs of texts. Let's break down the key components of the code:

  1. PairSimilarityPipeline Class: Inherits from a broader Pipeline class. Tokenization is performed using a specified tokenizer, with attention to padding and length constraints.

  2. Preprocessing: The preprocess method tokenizes and pads the input texts, preparing them for similarity assessment. First and second texts are processed separately, and their embeddings are obtained.

  3. Forward Method: The _forward method passes the tokenized inputs through the model. Model outputs for the first and second texts are obtained.

  4. Mean Pooling: The mean_pooling method calculates the mean embedding for each input text, considering attention masks for weighted pooling.

  5. Postprocessing: The postprocess method normalizes the vectors obtained through mean pooling. Similarity is computed as the dot product between normalized vectors.

mistral_pipeline = pipeline(model=mistral,
                            pipeline_class=PairSimilarityPipeline,
                            device_map='auto',
                            model_kwargs={'load_in_4bit':True},
                            torch_dtype=torch.bfloat16)

llama2_pipeline = pipeline(model=llama2,
                           pipeline_class=PairSimilarityPipeline,
                           device_map='auto',
                           model_kwargs={'load_in_4bit':True},
                           torch_dtype=torch.bfloat16)

mistral_pipeline({"first_text":"I'm happy", "second_text":"negative"})
llama2_pipeline({"first_text":"I'm happy", "second_text":"negative"})

The provided code demonstrates the application of pair similarity pipelines for Mistral-7B-v0.1 and Llama2 models. These pipelines are designed to evaluate the similarity between pairs of texts. The configuration involves setting up the pipelines for each model and then using them to assess the similarity between a given pair of texts.

  • Model Selection: The Mistral-7B or Llama2-7B model is chosen for similarity assessment.
  • Pipeline Class: The PairSimilarityPipeline class, designed for text pair evaluation, is employed.
  • Device Assignment: The code automatically assigns the device for model execution.
  • Model Configuration: Additional model-specific arguments are set, such as loading the model in 4-bit precision, and using bfloat16 data type.

The code showcases a streamlined process for implementing text pair similarity assessments using Mistral-7B-v0.1 and Llama2 models. By leveraging pre-configured pipelines, users can efficiently evaluate the semantic similarity between pairs of texts without dealing with intricate model-specific details. This approach enhances accessibility and usability, making these powerful language models more practical for various applications requiring text similarity evaluations.

dataset = load_dataset("smangrul/amazon_esci")

In the provided dataset, each entry is a dictionary with various fields, and we are interested in extracting specific information for calculating the Receiver Operating Characteristic (ROC). The relevant fields are 'query,' 'product_title,' and 'relevance_label.' Let's break down how to extract this information:

  1. 'query' Field: This field contains the query text, representing the user's search input.
  2. 'product_title' Field: Represents the title of the product being queried.
  3. 'relevance_label' Field: Indicates the relevance of the product to the user's query. In this case, it is a binary label (1 for relevant, 0 for irrelevant).

The code below defines a PairKeyDataset class that transforms an existing dataset into a pair dataset suitable for calculating cosine similarity between embeddings of text pairs. Let's break down the code to understand how it achieves this:

class PairKeyDataset(torch.utils.data.Dataset):
    def __init__(self, dataset: Dataset, key1: str, key2: str):
        self.dataset = dataset
        self.key1 = key1
        self.key2 = key2

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        return {"first_text" :self.dataset[i][self.key1], "second_text": self.dataset[i][self.key2]}

pair_key_dataset = PairKeyDataset(dataset['validation'], "query", "product_title")

The provided Python code defines a class called PairKeyDataset which serves the purpose of transforming an existing dataset into a pair dataset. This transformation is particularly useful for tasks that involve working with pairs of texts, such as training models for text similarity or retrieval.

In essence, the class takes in an original dataset along with two keys or field names (key1 and key2). These keys specify which fields in the dataset represent the first and second texts in the text pairs. The class provides methods to determine the length of the dataset and to retrieve a pair of texts for a given index. The output of the __getitem__ method is a dictionary with keys "first_text" and "second_text," representing the pair of texts.

Upon instantiation of this class, an instance named pair_key_dataset is created using a subset of the original dataset (specifically the validation subset). The keys "query" and "product_title" are specified as key1 and key2, respectively. This process effectively prepares the data for subsequent tasks, enabling the efficient creation and utilization of text pairs for model training or evaluation.

In summary, the PairKeyDataset class offers a structured and convenient approach to handling datasets for tasks that involve pairs of texts, contributing to the clarity and efficiency of subsequent operations on the data.

To facilitate ROC score calculation, we've extended the dataset to include true labels alongside Mistral and Llama2 similarity results. This enhancement involves creating a new dataset class, PairKeyDatasetWithLabels, which incorporates the true labels in addition to the text pairs. This modification allows us to evaluate the models' performance in predicting the relevance of text pairs.

true_labels = KeyDataset(dataset['validation'], "relevance_label")
mistral_similarities_iter = mistral_pipeline(pair_key_dataset, batch_size=32)
llama2_similarities_iter = llama2_pipeline(pair_key_dataset, batch_size=32)

mistral_list = []
llama2_list = []
label_list = []
for mistral, llama2, label in zip(mistral_similarities_iter, llama2_similarities_iter, true_labels):
    mistral_list.append(mistral)
    llama2_list.append(llama2)
    label_list.append(label)

sklearn.metrics.roc_auc_score(label_list, llama2_list)
sklearn.metrics.roc_auc_score(label_list, mistral_list)

This segment of code is designed to calculate the Receiver Operating Characteristic Area Under the Curve (ROC AUC) scores for Mistral-7B-v0.1 and Llama2 models in the context of text similarity. Here's an overview:

True Labels Initialization: Creates a dataset containing only the true labels ("relevance_label") from the validation subset.

Pipeline Execution: Invokes Mistral and Llama2 pipelines to obtain similarity scores for text pairs from the pair_key_dataset with a specified batch size.

List Initialization for Results: Initializes empty lists (mistral_list, llama2_list, label_list) to store Mistral scores, Llama2 scores, and true labels during iteration.

Iterating Over Pairs: Iterates through the similarity score iterators (mistral_similarities_iter and llama2_similarities_iter) along with the iterator for true labels (true_labels). Collects Mistral scores, Llama2 scores, and true labels into corresponding lists.

Calculating ROC Scores: Utilizes the roc_auc_score function from the sklearn.metrics module to calculate ROC AUC scores. Calculates ROC AUC scores for both Llama2 (label_list, llama2_list) and Mistral (label_list, mistral_list).

This code is integral for assessing the models' performance in predicting the relevance of text pairs. The ROC AUC scores provide a comprehensive evaluation, considering the trade-off between true positive rate and false positive rate, offering insights into the models' ability to distinguish between relevant and irrelevant text pairs.

Conclusion:

In this blog journey, we delved into the Mistral-7B-v0.1 and Llama2 models, exploring their architectures and applications. As we transitioned to practical implementation, transforming the dataset to create pairs of texts for cosine similarity evaluation, we aimed to draw meaningful insights into their performance.

The blog's focal point was the calculation of ROC AUC scores, a quantitative measure indicating the models' efficacy in predicting text pair relevance. Following this evaluation, Mistral-7B-v0.1 emerged with a notable ROC AUC score of 0.86, showcasing its prowess in capturing semantic relationships between text pairs. On the other hand, Llama2, while achieving a respectable score of 0.6, demonstrated a comparatively lower discriminatory ability.

The structured code emphasized a systematic approach to model evaluation, highlighting the significance of established metrics like ROC AUC in assessing natural language processing models. In the Mistral vs Llama2 showdown, Mistral-7B-v0.1's innovative architecture and superior benchmark performance position it as a powerful contender. Meanwhile, Llama2's strength lies in its collaborative, dialogue-focused approach and community-driven development.

The decision between Mistral and Llama2 depends on the specific requirements of the task at hand. Mistral's benchmark dominance makes it a robust choice for various applications, while Llama2's dialogue optimization and collaborative development approach offer unique advantages. As we navigate the dynamic landscape of language models, these evaluations become integral for making informed decisions when deploying models for real-world applications.