Motivation

Using LLMs can be beneficial for various use cases nowadays. Having options to access LLMs other than OpenAI might be cost-effective for you. One alternative way to use OpenAI GPT-3 is through the Azure ML service, which offers a pay-as-you-go option. This service provides access to other open LLMs like Meta, LLaMA, Phi-3, and others. In this blog, we showcase how to leverage access to these models in Azure and create a serverless endpoint to use them. We also demonstrate how to utilize this LLM endpoint to build a LangChain app. Finally, we develop an Azure container app to serve the LangChain endpoint.

Requirements

To follow along, you need an Azure account with a subscription that allows you to use Azure Machine Learning. After logging into Azure, create a workspace. For the purposes of this blog, we will use workspace_name="LlmOnAzure" as the name for the workspace. Follow the steps here to create a workspace.

Additionally, we need to use the Azure CLI, so you should install it by following the instructions here. You can verify that it is working correctly by running:

az version

After installing az cli, we need to work with the ML commands, so you need to install the ML extension:

az extension add -n ml

Also, since we are going to use mistral from the Azure catalog model, it's better to set the proper region. You can check here for availability. For our purposes, swedencentral works, so we choose that.

For this blog, ensure you have the following prepared:

subscription_id="......"
resource_group="....."
workspace_name="LlmOnAzure"
location="swedencentral"
endpoint_name="....."
containerapp_name="....."
containerapp_env="....."

To save and add the configuration in each command line, you can set the config through this:

az configure --defaults group=$resource_group workspace=$workspace_name location=$location subscription=$subscription_id

Mistral Serverless Endpoint

There are two ways to create an endpoint in Azure: managed or serverless. Here, we use a Mistral serverless endpoint, so we don't need any compute instance; we just need to ensure our workspace is in the proper region to use it. In the serverless option, you only pay for the tokens you use (input and output tokens), whereas in the managed server option, you have a GPU server that you pay for as long as it is running. To create a serverless endpoint, we first need to subscribe to the marketplace and then create the endpoint using the az command as shown below:

# Create subscription
az ml marketplace-subscription create -f endpoint-llm/subscription.yml --name $endpoint_name

# Create serverless endpoint
az ml serverless-endpoint create -f endpoint-llm/endpoint.yml --name $endpoint_name
endpoint_url=$(az ml serverless-endpoint show --name $endpoint_name --query "scoring_uri" -o tsv)

# Get credentials
primary_key=$(az ml serverless-endpoint get-credentials -n $endpoint_name --query "primaryKey" -o tsv)

Both subscription.yml and endpoint.yml files are like below:

# subscription.yml
name: mistral-small-llm-xxx
model_id: azureml://registries/azureml-mistral/models/Mistral-small

The name will be replaced using --name $endpoint_name in the command line. After that, we have endpoint_url and primary_key, which are all we need to call our LLM Mistral endpoint in Azure.

Deploy Langchain using Langserve on Azure

So far, we have created an LLM (Mistral) on Azure that we can use as a base LLM similar to an OpenAI LLM (ChatGPT). Now, let's create a Langchain and deploy it on Azure using this LLM endpoint. For this purpose, we can create a Docker container and push it to Azure to create an endpoint. We can use the az containerapp up command from our source folder with the structure below:

endpoint-langserve
|
|-----Dockerfile
|-----requirements.txt
|-----server.py

The Dockerfile is responsible for creating the container and looks like this:

FROM python:3.11
WORKDIR /code
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8001
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8001"]

As you can see, it uses FastAPI as a server and runs it using Uvicorn.

Here is the requirements.txt:

fastapi>=0.68.0,<0.69.0
pydantic>=1.8.0,<2.0.0
uvicorn>=0.15.0,<0.16.0
langchain
langserve
langchain-community
langchain-mistralai
sse_starlette

This is the code for using Langserve to create an API over our Langchain:

import os
import logging
logging.basicConfig(level=logging.INFO)
from typing import Optional

from fastapi import FastAPI
from langserve import add_routes
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)
from langchain.schema import SystemMessage
from langchain_mistralai.chat_models import ChatMistralAI
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI()

# Set all CORS enabled origins
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
    expose_headers=["*"],
)

prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content="You are a chatbot having a conversation with a human."
        ),
        MessagesPlaceholder(variable_name="chat_history"),
        HumanMessagePromptTemplate.from_template("{human_input}"),
    ]
)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

chat_model = ChatMistralAI(
    endpoint=os.environ['ENDPOINT_URL'],
    mistral_api_key=os.environ['ENDPOINT_API_KEY'],
)

chat_llm_chain = LLMChain(
    llm=chat_model,
    prompt=prompt,
    memory=memory,
    verbose=True,
)

add_routes(
    app,
    chat_llm_chain,
    path="/chat_llm",
)

Where do we use the LLM Mistral endpoint from the previous section? For creating ChatMistralAI, we use both ENDPOINT_URL and ENDPOINT_API_KEY, so we need to pass these two variables from the previous section to it when we want to run the container. The command for that is:

az containerapp up \
  --name $containerapp_name \
  --source endpoint-langserve \
  --resource-group $resource_group \
  --environment $containerapp_env \
  --env-vars ENDPOINT_URL=$endpoint_url ENDPOINT_API_KEY=$primary_key

At the end, you will have a langserve-endpoint that can be invoked like this in Python:

import requests

response = requests.post(
    f"{langserve-endpoint}/chat_llm/invoke",
    json={'input': {'human_input': 'tell me a bit more funny joke'}}
)

response.json()

Cost Monitoring

For cost monitoring of serverless ML models in Azure, you can easily go to the Azure portal. Under Cost Management + Billing/Cost Management, select Cost Analysis. Here, you should go to the Resources type view and then find or filter for microsoft.saas/resources. This type corresponds to resources created from offers from the Azure Marketplace. For convenience, you can filter by resource types containing the string SaaS. Then, you can see your-model-name-paygo-inference-output-tokens and your-model-name-paygo-inference-input-tokens costs, which represent the cost based on your input and output token usage. These services charge you based on token size, not the time you are using GPU.

Delete Resources

Since we have used serverless-endpoint, marketplace-subscription, and containerapp as resources, we should delete them as follows: