Motivation

In this blog post, we will explore best practices for fine-tuning Large Language Models (LLMs) using Azure. We'll address how to get started, as the initial steps may not be clear for data scientists, and mishandling cloud resources could lead to suboptimal results. We'll discuss various approaches and ways to leverage the cloud effectively for LLM fine-tuning.

Our focus will be on the LoRA (Low-Rank Adaptation) approach, and we assume you're already familiar with this concept. We'll guide you through three different pipelines, introduce the controller concept, and demonstrate how to develop processes following data science best practices.

Throughout this post, we'll cover:

The latest developments in Azure
Three distinct pipelines for efficient fine-tuning
The controller concept and its importance

Development Process in Data Science

When it comes to data science development, many of us immediately think of using Jupyter Notebooks. While Jupyter is great, there are cloud-based alternatives like Google Colab and Azure Notebooks that can offer distinct advantages depending on your project needs.

For enterprise and production-grade machine learning (ML) tasks, Azure Notebooks may be a better choice compared to Colab. Azure offers robust integration with other Azure services, scalability, and enhanced support for CI/CD pipelines, handling large datasets, and ensuring enterprise-grade security. Additionally, it allows you to reuse pre-built environments or datasets you may have created earlier in the production process. However, Azure Notebooks do require more setup and familiarity with the Azure ecosystem.

To get started with Azure Notebooks, you'll first need to build your environment. Here's how you can create a dev_env.yml file to set up an environment for large language model (LLM) fine-tuning, including the necessary packages:

name: dev_env
channels:
  - conda-forge
dependencies:
  - python=3.10
  - pip=24.0
  - pip:
    - bitsandbytes==0.43.1
    - transformers~=4.41
    - peft~=0.11
    - accelerate~=0.30
    - trl==0.9.4
    - einops==0.8.0
    - datasets==2.20.0
    - wandb==0.17.2
    - mlflow==2.15.0
    - azureml-mlflow==1.56.0
    - azureml-sdk==1.56.0
    - torchvision==0.18.1
    - ipykernel~=6.0
    - azure-ai-ml
    - debugpy~=1.6.3

Once your dev_env.yml file is ready, follow these steps to create and activate the environment in Azure:

Open the notebook panel in your Azure Workspace.
Upload the dev_env.yml file.
In a terminal window, run the following commands to create the environment and link it to your notebooks:

conda env list
conda env create -f dev_env.yml
conda activate dev_env
conda env list
python -m ipykernel install --user --name dev_env --display-name "Dev Env"

Now, you've successfully created a kernel with all the necessary packages installed. You can start creating notebooks using this environment and work on your .ipynb files with ease.

In Azure Machine Learning, you may have datasets stored as assets, which you can easily download into your notebook container. Let's say you have a data asset named poem_assets. Here's how you can retrieve it within your notebook:

Set Up Azure Credentials: First, you'll need to provide your subscription ID, resource group, and workspace details. These are essential for authenticating and connecting to your Azure ML workspace.
Use the Azure ML SDK to Access and Download Data: The following code demonstrates how to authenticate, connect to the workspace, and download the poem_assets data asset into your notebook container.

from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.utils import artifact_utils

# Define your Azure ML workspace details
SUBSCRIPTION_ID = "***"
RESOURCE_GROUP = "***"
WORKSPACE_NAME = "***"
LOCATION = "***"

# Authenticate using DefaultAzureCredential
credential = DefaultAzureCredential()

# Connect to the Azure ML workspace
workspace_ml_client = MLClient(
   credential,
   subscription_id=SUBSCRIPTION_ID,
   resource_group_name=RESOURCE_GROUP,
   workspace_name=WORKSPACE_NAME
)

# Fetch the data asset by its name and version
data_asset = workspace_ml_client.data.get("poem_assets", version="1")

# Download the asset to a local directory
artifact_utils.download_artifact_from_aml_uri(
   uri=data_asset.path,
   destination="./mlasset/",
   datastore_operation=workspace_ml_client.datastores
)

Authentication: The DefaultAzureCredential class simplifies the process of authentication. It can automatically pick the right credential type depending on the environment (e.g., environment variables, Azure CLI login, or managed identity).
Connect to the Workspace: The MLClient is used to connect to the Azure ML workspace. You’ll need to provide the workspace details, including your subscription ID, resource group, and workspace name.
Retrieve and Download the Data Asset: The get method fetches the data asset poem_assets, and you can specify the version of the asset (in this case, version 1). The download_artifact_from_aml_uri function downloads the dataset to the specified local directory (e.g., ./mlasset/).

Now, the data asset is available in your notebook container, and you can proceed with your analysis or model development.

When running your pipeline, creating infrastructure, or registering datasets and models, you need a starting point. This is typically accomplished via tools like the Azure CLI, the Python SDK within a notebook, DevOps Pipeline, or other CI/CD pipelines. We refer to this concept as a controller since it allows you to command the creation of resources and the execution of pipelines. In our case, we create a run.ipynb file where we define our requirements, and we run cells in the notebook to control the pipeline and other components.

Creating Different Clients for Various Resources in the Controller

When working with the Python Azure SDK (v2), different tasks require different types of clients. For instance, to work with machine learning resources, you use the MLClient class, while for working with a registry, you would use another specialized client. This modular approach allows you to handle various Azure resources efficiently.

For example, there are pre-built models like Mistral, LLaMA, and others that you can use by creating an Azure ML Registry client. Here’s an example of how to create clients for different Azure resources, such as Azure Machine Learning, Azure Container Registry (ACR), and the Resource Management service.

import os
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.containerregistry import ContainerRegistryManagementClient
from azure.mgmt.containerregistry.models import Registry, Sku
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Workspace

# Define environment variables and resource identifiers
SUBSCRIPTION_ID = "your-subscription-id"
RESOURCE_GROUP = "your-resource-group"
WORKSPACE_NAME = "llm-workspace"
LOCATION = "northeurope"
ACR_NAME = "ChatCompletionLLMRegistryACR"
HUGGINGFACE_TOKEN_READ = os.getenv("HUGGINGFACE_TOKEN_READ")

# Authentication: Attempt to use DefaultAzureCredential or fall back to InteractiveBrowserCredential
try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    print("Using interactive browser for authentication.")
    credential = InteractiveBrowserCredential()

# Create an ML Client for managing machine learning resources
ml_client = MLClient(
    credential,
    subscription_id=SUBSCRIPTION_ID,
    resource_group_name=RESOURCE_GROUP
)

# Create an ML Client for working with the Azure ML Registry
registry_ml_client = MLClient(credential, registry_name="azureml")

# Create a Resource Management Client
resource_client = ResourceManagementClient(credential, SUBSCRIPTION_ID)

# Create a Container Registry Management Client
acr_client = ContainerRegistryManagementClient(credential, SUBSCRIPTION_ID)

# Check if the Azure Container Registry exists, otherwise create it
try:
    registry = acr_client.registries.get(RESOURCE_GROUP, ACR_NAME)
    print(f"Registry {ACR_NAME} already exists")
except Exception as ex:
    print(f"Creating registry {ACR_NAME}")
    registry = acr_client.registries.begin_create(
        RESOURCE_GROUP,
        ACR_NAME,
        Registry(
            sku=Sku(name="Basic"),
            admin_user_enabled=True,
            location=LOCATION,
        ),
    ).result()

# Check if the workspace exists, otherwise create it with the container registry
try:
    ws = ml_client.workspaces.get(WORKSPACE_NAME)
    print(f"Workspace {WORKSPACE_NAME} already exists with container registry {ws.container_registry}")
except Exception as ex:
    print(f"Creating workspace {WORKSPACE_NAME} with container registry {ACR_NAME}")
    _ws = Workspace(
        name=WORKSPACE_NAME,
        location=LOCATION,
        container_registry=registry.id
    )
    ws = ml_client.workspaces.begin_create_or_update(_ws).result()
    print(f"Workspace {WORKSPACE_NAME} created with container registry {ws.container_registry}")

# Create a client for the workspace
workspace_ml_client = MLClient(
    credential,
    subscription_id=SUBSCRIPTION_ID,
    resource_group_name=RESOURCE_GROUP,
    workspace_name=ws.name,
)

Authentication: The code attempts to authenticate using DefaultAzureCredential. If that fails (e.g., in local environments), it falls back to InteractiveBrowserCredential for interactive login.
ML Client: We create an MLClient for working with Azure Machine Learning, which helps manage tasks like model training, dataset registration, and more.
Azure ML Registry Client: Another MLClient is created for accessing the Azure ML registry, where you can manage pre-built models like Mistral and LLaMA.
Container Registry: The ContainerRegistryManagementClient is used to check if a container registry exists and create one if needed. This is essential for container-based ML model deployments.
Workspace Management: We ensure that the Azure ML workspace exists, creating it if necessary, and then associate the workspace with the Azure Container Registry.

This setup ensures that each resource has a dedicated client, allowing you to handle different aspects of Azure services within the controller.

In the above example, we create an Azure Container Registry (ACR) and link it to the Azure ML workspace. It’s important to clarify that Azure Container Registry and Azure ML Registry serve different purposes, and it’s crucial not to confuse them:

Azure Container Registry (ACR): This is where you store and manage Docker container images. In the context of machine learning, you often use ACR to store Docker images that are required to run machine learning models or manage compute environments for model training and deployment. In the example, we create and link this ACR to the Azure ML workspace to ensure it can be used for these purposes.
Azure ML Registry: This, on the other hand, is used to store and manage machine learning models, datasets, and components across different workspaces. Azure ML Registry is more about managing ML artifacts rather than container images. In this example, we’re not using the Azure ML Registry, but instead, focusing on Azure Container Registry.

When configuring your workspace, ensure that you are linking Azure Container Registry (ACR) to the workspace, as shown in the example. The ACR helps manage Docker images needed for running ML experiments and deploying models, whereas Azure ML Registry serves a different purpose related to artifact management.

By clearly distinguishing between these two services, you can ensure that your infrastructure is set up correctly for the intended purpose.

In your fine-tuning process, you're using the pre-trained model mistralai-Mistral-7B-v01 from the Azure Foundation Models Registry. Here's an explanation of the steps involved in the code snippet you've provided:

model_name = "mistralai-Mistral-7B-v01"
foundation_model = registry_ml_client.models.get(model_name, label="latest")
print("\n\nUsing model name: {0}, version: {1}, id: {2} ".format(
        foundation_model.name, foundation_model.version, foundation_model.id
    ))

Model Name:
The model you're fine-tuning is mistralai-Mistral-7B-v01, a large pre-trained foundation model from the Mistral family of models.
Retrieving the Model:
You use the Azure Machine Learning Registry ML Client (registry_ml_client) to fetch the details of the latest version of this model from the Azure Foundation Models registry. By specifying label="latest", you ensure that you are always using the most up-to-date version of the model.
Printing Model Details:
Once retrieved, the code prints out key information about the model, such as:
- name: The name of the model (in this case, mistralai-Mistral-7B-v01).
- version: The version number of the model.
- id: The unique identifier for this model version in the registry.

Creating Environments in the Controller

In your controller, you are managing multiple environments for different use cases, specifically one for CPU-based tasks and another for GPU-based tasks. This allows you to optimize your workflows depending on the required computational resources. Below is a detailed explanation of how to define and use these environments in Azure Machine Learning (AML) via the Python SDK.

Creating a CPU Environment The CPU environment is defined using a conda YAML file. It includes essential Python packages and dependencies needed for data processing and model training on CPU.

YAML File for CPU Environment (cpu-conda.yaml):

name: prep_env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pandas
  - scikit-learn
  - pip
  - pip:
    - datasets
    - debugpy~=1.6.3
    - azure-identity
    - azure-keyvault-secrets
    - azureml-sdk==1.56.0

This YAML file specifies the packages required for the environment, including Python libraries like pandas and scikit-learn for data processing, as well as Azure SDKs for handling Azure services.

In your controller, you create this environment programmatically using the Environment class and linking it to the conda YAML file.

from azure.ai.ml.entities import Environment

# Define CPU environment from the conda YAML file
cpu_env = Environment(
    name="cpu_env",
    description="Environment for CPU",
    conda_file="dependencies/cpu-conda.yaml",  # Path to the YAML file
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",  # Base image
)

# Create or update the environment in Azure ML workspace
cpu_env = workspace_ml_client.environments.create_or_update(cpu_env)

Conda File: The environment's dependencies are specified in a cpu-conda.yaml file.
Base Image: The mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest Docker image is used as the base environment for this setup.

Creating a GPU Environment

For tasks that require GPU support (e.g., training deep learning models), a separate GPU environment is defined. This environment is built using a Dockerfile, providing more control over the setup.

Dockerfile for GPU Environment:

# PTCA image
FROM mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:64

USER root

# Update and upgrade system packages
RUN apt-get update && apt-get -y upgrade

# Upgrade pip
RUN pip install --upgrade pip

# Install necessary Python packages
RUN pip install peft~=0.11 --upgrade \
    azure-identity \
    azure-keyvault-secrets \
    azureml-sdk==1.56.0

This Dockerfile specifies a custom image for GPU workloads, installing essential packages such as peft, azure-identity, and azureml-sdk.

In the controller, you use the BuildContext to define the build context and Dockerfile for creating the GPU environment.

from azure.ai.ml.entities import Environment, BuildContext

# Define GPU environment from a Dockerfile
gpu_env = Environment(
    name="gpu_env",
    description="Environment for GPU",
    build=BuildContext(path="dependencies/gpu-env-context", dockerfile_path="Dockerfile"),
)

# Create or update the GPU environment in the Azure ML workspace
gpu_env = workspace_ml_client.environments.create_or_update(gpu_env)

Build Context: The BuildContext is used to specify the location of the Dockerfile and the build context directory (dependencies/gpu-env-context).
Dockerfile: The custom Dockerfile contains the instructions to set up the environment with GPU support and necessary packages.
CPU Environment: Defined using a conda YAML file, useful for tasks that don't require a GPU, like data preparation and smaller-scale model training.
GPU Environment: Built from a Dockerfile, tailored for GPU-intensive workloads like training deep learning models.

These environments can be seamlessly integrated into your Azure Machine Learning pipeline, ensuring that the right resources are used for the right tasks.

Preparation Pipeline for Fine-Tuning a GPT Model with Persian Poetry

In this pipeline, you're preparing a dataset to fine-tune a GPT model by structuring Persian poetry in a specific format. The goal is to combine every two lines of a poem with a tab (\t) in between, maintaining the unique structure of Persian poetry. Additionally, you concatenate every four lines into chunks, adding the poet's style to help the model learn how to generate in that style. The following is a step-by-step breakdown of the code and its components.

Imports and Argument Parsing: - You are using several libraries, including argparse, pandas, datasets, and Azure SDKs to interact with Azure Key Vault for securely accessing tokens. - Two command-line arguments are passed: - --raw_data: The path to the raw data. - --prep_data: The path where the prepped data will be saved.

Key Vault Credential Setup: - The code securely accesses an Azure Key Vault secret (HF-READ-TOKEN) using ManagedIdentityCredential, which helps authenticate without storing sensitive credentials in the code.

Function to Process CSV Files (make_poem_from_csv): - This function reads a CSV file containing poems, combines every two rows using a tab (\t) between the lines, and then groups four of these new combinations together, separated by double newlines (\n\n). - The function also appends the style (poet's name) to each chunk, which will be useful when fine-tuning the model to capture the stylistic differences between poets.

Function to Process Text Files (make_poem_from_txt): - Similar to the CSV function, this one handles text files, removing empty lines, combining every two lines with a tab (\t), and grouping every four such combinations with double newlines. - The style is also added to the output for fine-tuning purposes.

File Reading and Dataset Creation: - The script reads the files in the args.raw_data directory. Depending on the file type (CSV or TXT), it processes the files using the appropriate function (make_poem_from_csv or make_poem_from_txt). - It supports multiple files, appending each result to the dataset.

Dataset Splitting and Saving: - The combined data is loaded into a Hugging Face Dataset object. - The dataset is split into training and testing sets (80% training, 20% testing). - The DatasetDict object is created to hold both train and test datasets. - Finally, the datasets are saved as .jsonl (JSON lines) files, which are ready to be used for training the GPT model.

Detailed Code Explanation for prep.py:

import argparse
import os
from pathlib import Path
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from azure.identity import ManagedIdentityCredential
from azure.keyvault.secrets import SecretClient

# Argument parsing for raw and prepped data paths
parser = argparse.ArgumentParser("prep")
parser.add_argument("--raw_data", type=str, help="Path to raw data")
parser.add_argument("--prep_data", type=str, help="Path of prepped data")

args = parser.parse_args()

# Print out the paths
print("hello training world...")
print(f"Raw data path: {args.raw_data}")
print(f"Data output path: {args.prep_data}")

# List files in the raw data directory
arr = os.listdir(args.raw_data)
print("mounted_path files: ", arr)

# Azure Key Vault credential setup to access secret (e.g., Hugging Face token)
client_id = os.environ.get('DEFAULT_IDENTITY_CLIENT_ID')
credential = ManagedIdentityCredential(client_id=client_id)
secret_client = SecretClient(vault_url=os.environ.get('KV_URI'), credential=credential)
secret = secret_client.get_secret("HF-READ-TOKEN")
print("Secret value: ", secret.value)

# Function to process CSV files and prepare poem chunks
def make_poem_from_csv(path_to_csv, colname='Text', style='Shahname') -> list:
    df = pd.read_csv(path_to_csv)
    df['Combined'] = df[colname].shift(-1) + '\t' + df[colname]  # Combine every two lines
    df_combined = df.iloc[::2]  # Keep only the combined lines

    # Combine every 4 lines into a document-style chunk
    docs = []
    for i in range(0, len(df_combined), 4):
        doc_content = '\n\n'.join(df_combined['Combined'].iloc[i:i+4].dropna().tolist())
        docs.append({'style': style, 'output': doc_content})
    return docs

# Function to process TXT files and prepare poem chunks
def make_poem_from_txt(path_to_txt, style='Saadi') -> list:
    with open(path_to_txt, 'r') as file:
        lines = file.readlines()
    non_empty_lines = [line.strip() for line in lines if line.strip()]  # Remove empty lines

    # Combine lines into pairs
    docs = []
    combined_pairs = [f"{non_empty_lines[i]}\t{non_empty_lines[i+1]}" for i in range(0, len(non_empty_lines)-1, 2)]
    for i in range(0, len(combined_pairs), 4):
        doc_content = '\n\n'.join(combined_pairs[i:i+4])
        docs.append({'style': style, 'output': doc_content})
    return docs

# Reading and processing the raw data files
df_list = []
for filename in arr:
    if filename.endswith('.csv'):
        df_list.extend(make_poem_from_csv(os.path.join(args.raw_data, filename)))
    elif filename.endswith('.txt'):
        df_list.extend(make_poem_from_txt(os.path.join(args.raw_data, filename)))
    else:
        raise ValueError(f"File type not supported: {filename}")

# Create a Hugging Face dataset from the list of processed poems
dataset = Dataset.from_list(df_list)

# Split the dataset into train (80%) and test (20%) sets
train_test_ratio = 0.8
train_indices, test_indices = train_test_split(list(range(len(dataset))), test_size=1-train_test_ratio, shuffle=True)

# Select the train and test datasets
train_dataset = dataset.select(train_indices)
test_dataset = dataset.select(test_indices)

# Create a DatasetDict to hold train and test datasets
dataset_dict = DatasetDict({'train': train_dataset, 'test': test_dataset})

# Save the datasets to JSON lines format
for split in ['train', 'test']:
    dataset_dict[split].to_json(Path(args.prep_data) / f"{split}.jsonl")

Combining Lines: The code transforms every two lines of poetry into one line, and then groups four such lines together into chunks. This is aimed at training the GPT model to recognize this structure and generate poems in a similar format.

Adding Style Information: The style of the poet (e.g., "Shahname", "Saadi") is included in the dataset

Training Pipeline for Fine-Tuning GPT Model with Persian Poetry

This training pipeline fine-tunes the Mistral-7B model on Persian poetry, focusing on the style of famous Persian poets. The key point is the tokenizer behavior, where during training you use the <unk> (unknown) token for padding. This ensures that the model does not incorrectly learn to stop early if padding tokens (<pad>) are used. For inference, left padding is required because the decoder uses the last token to predict the next one, and padding at the end could alter the output quality.

Tokenizer Settings: - During training, the tokenizer pads sequences with the <unk> token instead of the EOS (end-of-sequence) or padding token. This prevents the model from prematurely stopping when encountering padding. - For inference, left padding is necessary because the last token the decoder sees before generating the next token needs to be part of the actual input, not padding.

Prompt Template: - The model is prompted using a template that asks it to generate poetry in a specific style (e.g., style of a Persian poet). This prompt is passed along with the data to train the model in generating poetry with the right stylistic touch.

PROMPT_TEMPLATE = """As a poet in style of {style}, write a poem in Persian language.

### Response:
{output}"""

Model and Tokenizer Setup: - The pre-trained model mistralai/Mistral-7B-v0.1 is used. - The tokenizer is set to: - Pad tokens using the <unk> token (tokenizer.pad_token = tokenizer.unk_token). - Use left padding (tokenizer.padding_side = 'left').

tokenizer = AutoTokenizer.from_pretrained(model_and_tokenizer_path)
tokenizer.model_max_length = MAX_LENGTH
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = 'left'

Quantization Config: - The model is fine-tuned using 4-bit quantization to reduce memory usage and enable efficient training on larger models.

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

LoRA Configuration (Parameter Efficient Fine-Tuning): - The model uses LoRA (Low-Rank Adaptation), a technique to fine-tune the model efficiently with minimal memory overhead. The LoRA configuration is set with specific parameters such as rank (r), dropout, and the target modules for fine-tuning.

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules="all-linear",
    bias="none",
)

Training Arguments: - The training uses mixed precision (fp16) and gradient checkpointing to reduce memory consumption and accelerate training. - Key arguments include learning rate, batch size, gradient accumulation steps, and the maximum number of steps.

training_args = TrainingArguments(
    report_to="mlflow",
    run_name=f"Mistral-7B-v0.1-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    fp16=True,
    learning_rate=5e-6,
    lr_scheduler_type="constant",
    max_steps=args.max_steps,
    num_train_epochs=args.epoch,
    save_steps=10,
    logging_steps=10,
    warmup_steps=5,
    ddp_find_unused_parameters=False,
)

Dataset Processing: - The dataset is processed with a prompt template and tokenized. The length of each input sequence is analyzed to ensure proper padding and truncation.

processed_dataset = read_dataset.map(apply_prompt_template_and_tokenize,
                                    fn_kwargs={"tokenizer": tokenizer, "max_length": MAX_LENGTH},
                                    num_proc=10,
                                    batched=False)

lengths = [x["length"] for x in processed_dataset["train"]]
plt.hist(lengths, bins=20)
plt.savefig(f"{output_dir}/lengths.png")
mlflow.log_figure(fig, "lengths.png")

Training Loop: - The trainer uses the processed dataset (after removing unnecessary columns like style and length) and runs the training loop with the LoRA-adapted model.

trainer = Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    args=training_args,
)

peft_model.config.use_cache = False
trainer.train()

Model Registration and Saving: - The fine-tuned model is saved and registered with MLflow for version control, tracking, and future deployment.

inference_params = {
    "max_new_tokens": 512,
    "repetition_penalty": 1.15,
    "return_full_text": False
}

sample = processed_dataset['train'][1]

# MLflow infers schema from the provided sample input/output/params
signature = infer_signature(
    model_input=sample["style"],
    model_output=sample["style"],
    # Parameters are saved with default values if specified
    params=inference_params,
)

mlflow.transformers.log_model(
    transformers_model={"model": trainer.model, "tokenizer": tokenizer},
    prompt_template=PROMPT_TEMPLATE_NO_OUTPUT,
    signature=signature,
    registered_model_name=registered_model_name,
    artifact_path=registered_model_name,
)

mlflow.transformers.save_model(
    transformers_model={"model": trainer.model, "tokenizer": tokenizer},
    path=model_dir,
    prompt_template=PROMPT_TEMPLATE_NO_OUTPUT,
    signature=signature,
)

Padding During Training: The use of <unk> as the padding token is critical during training, as it prevents the model from learning to stop early if it encounters padding tokens. Left Padding for Inference: Left padding is used during inference to ensure that the decoder relies on actual input tokens rather than padding for prediction. Efficiency: Techniques such as quantization, gradient checkpointing, and LoRA are used to make the training efficient, especially for large models like Mistral-7B.

This pipeline is designed to fine-tune the GPT model on Persian poetry, capturing the style of famous poets and generating poetry with the learned style and structure.

Merging LoRA Adapter with Base Model for Deployment

This pipeline is designed to merge the LoRA (Low-Rank Adaptation) adapter, which was trained separately, back into the base model. The merged model is then saved and registered with MLflow, making it ready for deployment. Here's a detailed breakdown of the process and the code.

Key Steps in the Process of merge.py:

Load the LoRA Adapter and Base Model: - The LoRA adapter fine-tunes the base model on specific tasks (like generating Persian poetry). Once training is complete, we merge the LoRA weights with the base model to form a final model ready for inference and deployment.

The AutoPeftModelForCausalLM class from the peft library is used to load the model and merge the LoRA adapter with the base model.

Merge and Unload the LoRA Adapter: - The merge_and_unload() function is called to merge the adapter's weights with the base model. This step finalizes the model by incorporating the learned LoRA weights and making it behave as a single complete model.

MLflow Integration: - Once the model is merged, it is registered with MLflow for tracking and version control. This allows for the model to be easily deployed or shared within a team or organization.

Save the Final Model: - The merged model is saved as a Hugging Face model along with its tokenizer. This model is made available for deployment, inference, or further fine-tuning as needed.

import os
import argparse
import logging
from azure.identity import ManagedIdentityCredential
from azure.keyvault.secrets import SecretClient
from huggingface_hub import login
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline
from mlflow.models import infer_signature
import mlflow
from transformers import BitsAndBytesConfig

logger = logging.getLogger(__name__)

def main(args):
    model_input = args.model_input

    # List the files in the model input directory (for debugging/logging purposes)
    for root, dirs, files in os.walk(model_input):
        for file in files:
            print(os.path.join(root, file))

    # Fetch the model metadata from MLflow
    model_info = mlflow.models.get_model_info(model_input)

    # Retrieve the prompt template and signature from the model metadata
    PROMPT_TEMPLATE_NO_OUTPUT = model_info.metadata['prompt_template']
    print("PROMPT_TEMPLATE_NO_OUTPUT:", PROMPT_TEMPLATE_NO_OUTPUT)

    signature = model_info.signature
    print("Signature:", signature)

    # Load the model using 8-bit quantization for efficiency
    quantization_config = BitsAndBytesConfig(load_in_8bit=True)
    model = AutoPeftModelForCausalLM.from_pretrained(
        f"{model_input}/peft/",
        device_map="auto", 
        quantization_config=quantization_config
    )

    # Merge the LoRA adapter into the base model and unload the adapter weights
    model = model.merge_and_unload()

    # Load the tokenizer associated with the model
    tokenizer = AutoTokenizer.from_pretrained(model_info.flavors['transformers']['tokenizer_name'])
    tokenizer.pad_token = tokenizer.unk_token  # Set the padding token to <unk> to avoid incorrect generation during inference
    tokenizer.padding_side = 'left'

    # Register the merged model in MLflow
    registered_model_name = "poem-mistral-merged"
    mlflow.transformers.log_model(
        transformers_model=pipeline(task="text-generation", model=model, tokenizer=tokenizer),
        prompt_template=PROMPT_TEMPLATE_NO_OUTPUT,
        signature=signature,
        registered_model_name=registered_model_name,
        artifact_path=registered_model_name,
    )


def parse_args():
    # Setup argparse to receive command-line arguments
    parser = argparse.ArgumentParser()

    # Argument to pass the model input directory
    parser.add_argument("--model-input", type=str, help="Input model directory")

    # Parse and return arguments
    return parser.parse_args()


# Entry point for the script
if __name__ == "__main__":
    args = parse_args()
    main(args)

LoRA Adapter Merging: After training the LoRA adapter separately, it is merged with the base model using the merge_and_unload() method, resulting in a final model that is ready for deployment.
Model Quantization: The model is loaded with 8-bit quantization, making it more efficient in terms of memory usage while retaining performance.
MLflow Registration: The final merged model is logged and registered in MLflow for version control, tracking, and future deployment.

This pipeline prepares the model for deployment by integrating the learned LoRA parameters into the base model and saving the final model with Hugging Face transformers and MLflow integration.

Final Pipeline in the Controller

This section explains the controller setup for managing an end-to-end pipeline for fine-tuning a GPT model on Persian poetry. The pipeline includes three main stages:

Preparation Stage: Prepares and processes the raw data. Training Stage: Fine-tunes the model using LoRA on GPU resources. Merging Stage: Merges the LoRA adapter with the base model to create a final deployable model.

The pipeline orchestrates these stages and manages resource allocation, environment variables, and job submission.

Preparation Job (prep_job): - This stage processes the raw data into a format that the model can use for training. - It uses a CPU environment (cpu_env) for lightweight data preparation tasks.

prep_job = command(
    inputs=dict(
        raw_data=Input(
            type="uri_folder",
            path="./raw_data/"
        )
    ),
    outputs=dict(
        prep_data=Output(
            type="uri_folder",
            mode="rw_mount",
            name="poem_assets",
        )
    ),
    code="./prep_pipeline",
    command="python prep.py --raw_data ${{inputs.raw_data}} --prep_data ${{outputs.prep_data}}",
    environment=cpu_env,
    environment_variables={
        "KV_URI": keyvault_uri,
    },
    compute=compute_cluster_cpu,
)

Training Job (train_job): - This stage fine-tunes the base model using the prepared data. - It runs on GPU resources (gpu_env) with distributed training enabled via PyTorch, using the accelerate launcher for distributed setup. - It accepts training parameters like epoch and max_steps for fine-tuning the model.

train_job = command(
    inputs=dict(
        input_data=Input(
            type="uri_folder",
            mode="ro_mount"
        ),
        model_input=Input(
            type="mlflow_model",
            mode="ro_mount",
        ),
        epoch=1,
        max_steps=30,
    ),
    outputs=dict(
        output_dir=Output(
            type="uri_folder",
            mode="rw_mount"
        ),
        model_output=Output(
            type="mlflow_model",
        ),
    ),
    code="./train_pipeline",
    compute=compute_cluster_gpu,
    command="accelerate launch train.py --input-data ${{inputs.input_data}} --model-input ${{inputs.model_input}} --output-dir ${{outputs.output_dir}} --model-dir ${{outputs.model_output}} --epoch ${{inputs.epoch}} --max-steps ${{inputs.max_steps}}",
    experiment_name="mistral-lora-training-poems",
    environment=gpu_env,
    environment_variables={
        "KV_URI": keyvault_uri,
    },
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,
    },
)

Merging Job (merge_job): - After fine-tuning, this stage merges the LoRA adapter with the base model to create a final deployable model. - It runs on GPU resources and does not require distributed computing.

merge_job = command(
    inputs=dict(
        model_input=Input(
            type="mlflow_model",
            mode="ro_mount",
        ),
    ),
    code="./merge_models_pipeline",
    compute=compute_cluster_gpu,
    command="python merge.py --model-input ${{inputs.model_input}}",
    environment=gpu_env,
    environment_variables={
        "KV_URI": keyvault_uri,
    },
)

Full Pipeline Definition

The pipeline is defined with three nodes: - Preparation Node: Processes the raw data. - Training Node: Fine-tunes the model. - Merging Node: Merges the LoRA adapter with the base model.

@pipeline(
    name="poem_finetune_pipeline",
    description="Pipeline for training and scoring poems",
)
def poem_finetune(pipeline_input_data, base_model_input):
    # Data preparation step
    prep_job_node = prep_job(raw_data=pipeline_input_data)

    # Training step, takes the output of the prep job
    train_job_node = train_job(input_data=prep_job_node.outputs.prep_data, model_input=base_model_input)

    # Merging step, takes the output of the training job
    merge_job_node = merge_job(model_input=train_job_node.outputs.model_output)

Submitting and Running the Pipeline

To execute the pipeline, it is first created with poem_finetune_pipeline and then submitted to the Azure ML workspace. The job is streamed to track its progress in real-time.

raw_data = Input(
    type="uri_folder",
    path="./raw_data/"
)
poem_finetune_pipeline = poem_finetune(pipeline_input_data=raw_data, base_model_input=foundation_model)

# Submit and run the pipeline job
pipeline_job_submit = workspace_ml_client.jobs.create_or_update(
    poem_finetune_pipeline, experiment_name="poem_finetune_pipeline"
)
workspace_ml_client.jobs.stream(pipeline_job_submit.name)

Data Preparation: The prep_job processes raw Persian poetry data into a format suitable for training. Model Training: The train_job fine-tunes a GPT model (with LoRA adapters) on the prepared data. Model Merging: The merge_job merges the fine-tuned LoRA adapters with the base model to form a single deployable model. Execution: The pipeline is run and monitored via Azure Machine Learning, making it an efficient and scalable solution for large model fine-tuning.

This pipeline provides a well-structured, end-to-end system for preparing data, training the model, and merging adapters, ensuring that the model is ready for deployment or further evaluation.

Inference in the Notebook for Fine-Tuned Model

After successfully fine-tuning your model (e.g., poem-mistral-merged), it's time to test the model to generate predictions (in this case, poetry) in a notebook. The inference process is separate from the controller and involves running a notebook that can load the model, execute predictions, and work with a GPU environment for high computational efficiency.

You can either deploy the model using an endpoint or run inference directly from the notebook. Below is an approach using a notebook (infer_dev.ipynb) for loading and testing the model.

In this notebook, you will load the model from the Azure ML model registry, apply the necessary inference parameters, and generate poetry based on the model's fine-tuning. Here's how to do it step-by-step.

# Import necessary libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
import mlflow

# Set up workspace information
SUBSCRIPTION_ID = "***"  # Replace with your Azure subscription ID
RESOURCE_GROUP = "***"   # Replace with your Azure resource group
WORKSPACE_NAME = "llm"   # Replace with your Azure ML workspace
LOCATION = "northeurope" # Set your region

# Authenticate using DefaultAzureCredential
credential = DefaultAzureCredential()

# Initialize MLClient to interact with Azure ML workspace
workspace_ml_client = MLClient(
    credential,
    subscription_id=SUBSCRIPTION_ID,
    resource_group_name=RESOURCE_GROUP,
    workspace_name=WORKSPACE_NAME,
)

# Load the fine-tuned model from the Azure ML model registry
model_name = "poem-mistral-merged"  # The name of your merged and fine-tuned model
mlflow_model = mlflow.pyfunc.load_model(f"models:/{model_name}/latest")

# Define inference parameters for text generation
inference_params = {
    "max_new_tokens": 512,         # Maximum tokens in the generated response
    "repetition_penalty": 1.15,    # Repetition penalty to improve text diversity
    "return_full_text": True,      # Return the full generated text
    "do_sample": True,             # Enable sampling for more creative outputs
}

# Test the model by generating poetry in the style of 'Shahname'
input_prompt = 'Shahname'
generated_text = mlflow_model.predict(input_prompt, params=inference_params)

# Print the generated poem
print(f"Generated poem:\n\n{generated_text}")

Authentication: - The DefaultAzureCredential() is used to authenticate the current environment against Azure services. It works in different environments like development machines, Azure VMs, or managed identities.

ML Client Initialization: - The MLClient object is used to interact with the Azure Machine Learning workspace, model registry, datasets, and more. You need to provide the Azure subscription ID, resource group, and workspace name to correctly connect to your workspace.

Model Loading: - The mlflow.pyfunc.load_model function is used to load the fine-tuned model (poem-mistral-merged) from the Azure model registry. - This loads the model that was fine-tuned with Persian poetry and prepared for inference.

Inference Parameters: - max_new_tokens: Specifies the maximum number of tokens to generate in the output. - repetition_penalty: Penalizes repetitive tokens to encourage more creative and diverse text generation. - return_full_text: Returns the complete text rather than truncated outputs. - do_sample: Enables sampling to allow the model to be more creative and generate diverse outputs.

Generating the Poem: - The input prompt 'Shahname' is passed to the model to generate a Persian poem in the style of Shahname. - The output is printed, showcasing the model’s ability to generate poetry based on the prompt and fine-tuning it underwent during training.

GPU Requirement: The notebook should be run in an environment with access to GPU resources to handle the computational load of generating text using large models like Mistral-7B.
Environment Setup: The Dev Env environment created during the development process will be used for running this notebook. Ensure that the environment has all the necessary dependencies (e.g., transformers, peft, mlflow) installed.
Testing Flexibility: You can modify the input_prompt and inference_params to test the model under different conditions or generate poetry in different styles.

This notebook (infer_dev.ipynb) provides a flexible way to run inference on your fine-tuned model without deploying it as an endpoint. It allows you to test and explore the model's output interactively, adjusting parameters to fit the needs of your specific task. You can now test the model's ability to generate Persian poetry, fine-tuned to the unique style of poets like Ferdowsi (Shahname) or Saadi, depending on your input.

Fine-Tune LLM Azure & MLFlow:Data Science