Motivation

It often happens that you want to use your own fine-tuned model or popular models like Meta LLaMA, Stable Diffusion, or others to perform inference and execute a job. These are commonly referred to as batch-endpoint jobs in Microsoft Azure. In this blog post, we will cover how to use Stable Diffusion SDXL to generate images entirely within Azure under a batch job, with some customizations you may need, such as dependencies using the diffusion library or custom infer.py code. The image above is generated using the output of this blog post.

Requirements

To follow along, you need an Azure account with a subscription that allows you to use Azure Machine Learning. After logging into Azure, create a workspace. For the purposes of this blog, we will use workspace_name="Generate-Image" as the name for the workspace. Follow the steps here to create a workspace.

Since we plan to use a GPU, and if this is your first time, you need to request an increase in your GPU quota. The GPU compute instance required is either Standard_NC6s_v3 or Standard_NC4as_T4_v3. You can request an increase through the contact+support panel in your Azure portal.

For this blog, ensure you have the following prepared:

subscription_id="......"
resource_group="....."
workspace_name="Generate-Image"
location="......"

Additionally, we need to use the Azure CLI, so you should install it by following the instructions here. You can verify that it is working correctly by running:

az version

Setup Pre-requisites

Besides the variables mentioned above, we have other prerequisites that will be introduced later. But why use bash? Think of it as the control room where you manage resources, configure settings, and integrate everything to leverage the benefits of Azure Machine Learning. There are many .sh files in official Azure examples that you can get inspiration from. Here are the variables we need for our job:

# Model from system registry that needs to be deployed
model_name="stabilityai-stable-diffusion-xl-base-1-0"
model_label="latest"

endpoint_name="text-to-image"
deployment_name="generate-image-batch-sdxl"

deployment_compute="gpu-cluster"
compute_sku="Standard_NC6s_v3"

base_endpoint_path="endpoint"

Now we need to configure our settings and introduce ourselves to az cli for the first time:

# Configure defaults
az configure --defaults group=$resource_group workspace=$workspace_name location=$location subscription=$subscription_id

There are famous models uploaded in Azure, such as SDXL, under the registry_name="azureml" registry name. We can access them directly; otherwise, we need to upload our own model into our registry to make it accessible in Azure. To check the model information, use the command below:

# Check if the model exists in the registry
if ! az ml model show --name $model_name --label $model_label --registry-name $registry_name 
then
    echo "Model $model_name:$model_label does not exist in registry $registry_name"
    exit 1
fi

Later, we need the model version, which we can get from the model information above:

model_version=$(az ml model show --name $model_name --label $model_label --registry-name $registry_name --query version --output tsv)

Great! Now we need to set up a compute instance. We can do it manually through the Azure portal, Visual Studio Code, or az cli to create a cluster with the instance type Standard_NC6s_v3. If it does not exist, use the command below (since we may run the script multiple times during development, it is good practice to use an "if not exist" statement for each module creation to avoid errors):

# Check if compute $deployment_compute exists, else create it
if az ml compute show --name $deployment_compute $workspace_info
then
    echo "Compute cluster $deployment_compute already exists"
else
    echo "Creating compute cluster $deployment_compute"
    az ml compute create --name $deployment_compute --type amlcompute --min-instances 0 --max-instances 2 --size $compute_sku $workspace_info || {
        echo "Failed to create compute cluster $deployment_compute"
        exit 1
    }
fi

To run the job and generate the output, we need some input data. In this case, the input data is a prompt for SDXL to generate an image. The input data can be in the form of a CSV file. For example, you can create a CSV file with the following content:

,prompt,negative_prompt,width,height
0,"photo of a rhino dressed in a suit and tie as a reporter telling news on TV, award-winning photography, photorealistic","ugly, deformed, cartoon, illustration, animation, face, male, female",720,720

Let's call the folder containing the CSV file processed_chunk_path and keep it under this variable in the bash script:

# Define the path to the CSV file
processed_chunk_path="<PATH_TO_CSV_FILE>"

Deploy Model

What does "Model" refer to here? This is an important question to understand how cloud operations work. Essentially, "Model" encompasses everything from dependencies, model artifacts (torch files), infer.py code, and the docker image. If any of these components change, the model needs to be redeployed. We can deploy the model to an endpoint and then invoke the endpoint to see how our deployment works. Often, resources are created using yml files, but to keep things dynamic and be able to change variables within the yml file, here is a bash function that allows us to modify names inside the yml file:

# Helper function to change parameters in yaml files
change_vars() {
  for FILE in "$@"; do 
    TMP="${FILE}_"
    cp $FILE $TMP 
    readarray -t VARS < <(cat $TMP | grep -oP '{{.*?}}' | sed -e 's/[}{]//g'); 
    for VAR in "${VARS[@]}"; do
      sed -i "s/{{${VAR}}}/${!VAR}/g" $TMP
    done
  done
}

Now, to create the endpoint, we can use:

if [ $(az ml batch-endpoint list --query "[?name=='$endpoint_name'].name" -o tsv) ]; then
  echo "Endpoint $endpoint_name already exists."
else 
  echo "Endpoint $endpoint_name does not exist. Creating it..."
  change_vars $base_endpoint_path/generate-endpoint.yml
  az ml batch-endpoint create -f $base_endpoint_path/generate-endpoint.yml_
fi
rm $base_endpoint_path/generate-endpoint.yml_

And here is the corresponding yml file for generate-endpoint.yml_:

$schema: https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json
name: "{{endpoint_name}}"

The change_vars() function replaces endpoint_name with those set in the bash script before. After creation, we can get the endpoint URL by:

# Get scoring url 
echo "Getting scoring URL..."
scoring_url=$(az ml batch-endpoint show -n $endpoint_name --query scoring_uri -o tsv)
echo "Scoring url is $scoring_url"

Now, it's time to create the deployment. Here is the yml file:

$schema: https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
name: generate-image
type: model
endpoint_name: "{{endpoint_name}}"
model: azureml://registries/azureml/models/{{model_name}}/versions/{{model_version}}
code_configuration: 
  code: code
  scoring_script: infer.py
environment:  
  image: mcr.microsoft.com/azureml/minimal-ubuntu22.04-py39-cuda11.8-gpu-inference:latest
  conda_file: environment/conda.yaml
resources:
    instance_count: 1
settings:
    mini_batch_size: 1
    retry_settings:
        max_retries: 2
        timeout: 9999

There is another way to set variables into the yml file with the az command using --set:

az ml batch-deployment create -f $base_endpoint_path/generate-deployment.yml --set \
  endpoint_name=$endpoint_name \
  name=$deployment_name \
  compute=$deployment_compute \
  model=azureml://registries/$registry_name/models/$model_name/versions/$model_version || {
    echo "deployment create failed"; exit 1;
}

One of the variables set is model=azureml://registries/$registry_name/models/$model_name/versions/$model_version, which refers to SDXL. The other variables are names and the compute instance, which were defined earlier. The base image is image: mcr.microsoft.com/azureml/minimal-ubuntu22.04-py39-cuda11.8-gpu-inference:latest, and Azure will install conda_file: environment/conda.yaml into this image so we can handle our dependencies here. Here is the conda file we need for our purpose:

name: model-env
channels:
- anaconda
- pytorch
- conda-forge
dependencies:
- python=3.8.16
- pip<=23.0.1
- pip:
  - mlflow==2.3.2
  - torch==2.0
  - transformers==4.29.1
  - diffusers==0.23.0
  - accelerate==0.22.0
  - azureml-core==1.52.0
  - azureml-mlflow==1.52.0
  - azure-ai-contentsafety==1.0.0b1
  - aiolimiter==1.1.0
  - azure-ai-mlmonitoring==0.1.0a3
  - azure-mgmt-cognitiveservices==13.4.0
  - azure-identity==1.13.0

Here, we need to address a particular concern. While Azure provides the possibility to construct custom Docker images from a Dockerfile, there's currently no provision to utilize a separate conda file, such as conda_file: environment/conda.yaml. Consequently, we're required to incorporate all the conda dependencies into our custom Dockerfile.

In the code configuration section, we define how the inference code should work. Here is our infer.py script:

import base64
import logging
import os
import pandas as pd
import torch
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler
from vision_utils import image_to_base64, save_image

def init():
    global base
    global refiner
    global g_logger
    global output_path

    g_logger = logging.getLogger("azureml")
    g_logger.setLevel(logging.INFO)

    # Log whether CUDA is available
    g_logger.info("CUDA available: " + str(torch.cuda.is_available()))

    # Output directory and path
    output_path = os.environ["AZUREML_BI_OUTPUT_PATH"]
    g_logger.info("Output path: " + output_path)

    # Input directory and path
    model_dir = os.environ["AZUREML_MODEL_DIR"]
    mlflow_folder = os.listdir(model_dir)[0]
    model_path = os.path.join(model_dir, mlflow_folder, "artifacts", "INPUT_model_path")
    g_logger.info("Model path: " + model_path)

    # List all files in the model path
    files = os.listdir(model_path)
    g_logger.info("Model files: " + str(files))

    # Load base and refiner of SDXL
    base = DiffusionPipeline.from_pretrained(
        model_path,
        safety_checker=None,
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True,
    ).to("cuda")
    base.scheduler = EulerAncestralDiscreteScheduler.from_config(base.scheduler.config)
    base.enable_model_cpu_offload()

    refiner = DiffusionPipeline.from_pretrained(
        model_path,
        safety_checker=None,
        text_encoder_2=base.text_encoder_2,
        vae=base.vae,
        torch_dtype=torch.float16,
        use_safetensors=True,
        variant="fp16",
    ).to("cuda")
    refiner.scheduler = EulerAncestralDiscreteScheduler.from_config(refiner.scheduler.config)
    refiner.enable_model_cpu_offload()

    g_logger.info("Init complete")

def run(files):
    g_logger.info("Input text files: " + str(files))
    output_list = []
    for data_file in files:
        input_pd = pd.read_csv(data_file, header=0)
        g_logger.info("Input data file: " + data_file.split("/")[-1])
        g_logger.info("Input shape: " + str(input_pd.shape))

        # Input columns: prompt, negative_prompt, width, height
        assert len(set(input_pd["width"])) == 1, "All width values should be the same"
        assert len(set(input_pd["height"])) == 1, "All height values should be the same"

        # Get width and height from the first row
        row = input_pd.iloc[0]
        width = row["width"]
        height = row["height"]

        # List of prompts and negative prompts
        prompts = list(input_pd["prompt"])
        negative_prompts = list(input_pd["negative_prompt"])

        g_logger.info(f"Prompts size: {len(prompts)}, Negative prompts size: {len(negative_prompts)}")

        image = base(
            prompt=prompts,
            width=width,
            height=height,
            negative_prompt=negative_prompts,
            num_inference_steps=40,
            denoising_end=0.8,
            output_type="latent",
        ).images

        generated_images = refiner(
            prompt=prompts,
            width=width,
            height=height,
            negative_prompt=negative_prompts,
            num_inference_steps=40,
            denoising_start=0.8,
            aesthetic_score=10,
            negative_aesthetic_score=2.4,
            image=image,
        ).images

        g_logger.info("Generated images size: " + str(len(generated_images)))
        for image in generated_images:
            base64_img = image_to_base64(image, "PNG")
            saved_path = save_image(output_path, image, "PNG")
            g_logger.info("Saved image path: " + saved_path)

            output_list.append((base64_img, saved_path))

    return pd.DataFrame(output_list, columns=["base64", "path"])

Based on the base image and model that we're using, there are couple of important points to consider: - The input path for accessing the SDXL model is os.environ["AZUREML_MODEL_DIR"]. - The output path for saving the generated image is output_path = os.environ["AZUREML_BI_OUTPUT_PATH"], which allows us to download it from the job's output in Azure. - We provide width and height parameters to generate an image, primarily to illustrate how parameter passing can be implemented. - When dealing with large models that barely fit into memory, it's advisable to consider optimization techniques such as quantization. Given that our compute instance has only 16GB of memory, we utilize enable_model_cpu_offload from the diffusers library for efficient processing.

Invoke Endpoint

Now that deployment is completed, we can invoke the endpoint to execute our batch job for the provided input CSV file saved earlier.

job_name=$(az ml batch-endpoint invoke --name $endpoint_name \
 --deployment-name $deployment_name \
 --input $processed_chunk_path \
 --input-type uri_folder --query name --output tsv) || {
    echo "Endpoint invocation failed"; exit 1;
}

This command initiates the job. To monitor its progress and determine when it will finish, we can stream the job logs using:

az ml job stream --name $job_name || {
    echo "Job streaming failed. If the job fails with an Assertion Error stating that the actual size of the CSV exceeds 100 MB, consider splitting the input CSV file into multiple files."; exit 1;
}

Streaming may take up to 20 minutes. Once streaming is complete, we can check the status of our job:

status=$(az ml job show -n $job_name --query status -o tsv)
echo $status
if [[ $status == "Completed" ]]
then
  echo "Job completed"
elif [[ $status ==  "Failed" ]]
then
  echo "Job failed"
  exit 1
else 
  echo "Job status is neither failed nor completed"
  exit 2
fi

Finally, we can easily download the output based on the job name if we've saved it under output_path = os.environ["AZUREML_BI_OUTPUT_PATH"] in the infer.py code:

az ml job download --name $job_name --download-path "$base_endpoint_path/generated_images" || {
    echo "Job output download failed"; exit 1;
}

It's worth noting that sometimes jobs encounter errors for certain files in the input folders, but the status will still appear as completed. Azure handles these exceptions, so it's important to be aware of this. Utilize logging, as demonstrated in infer.py. These logs are stored under the path logs/user/stdout/<node_id>/processNNN.stdout.txt in the logging section of the job.

Clean Up

Azure provides us with commands for deleting the endpoint and compute as shown below:

az ml batch-endpoint delete --name $endpoint_name --yes || {
    echo "Endpoint deletion failed"; exit 1;
}

az ml compute delete --name $deployment_compute --yes || {
    echo "Compute deletion failed"; exit 1;
}