Using Azure Functions in Azure Data Factory

(2020-Apr-19) Creating a data solution with Azure Data Factory (ADF) may look like a straightforward process: you have incoming datasets, business rules of how to connect and change them and a final destination environment to save this transformed data. Very often your data transformation may require more complex business logic that can only be developed externally (scripts, functions, web-services, databricks notebooks, etc.).

In this blog post, I will try to share my experience of using Azure Functions in my Data Factory workflows: my highs and lows of using them, my victories and struggles to make them work. If you share the same pain points, if you find any mistakes or feel a total misrepresentation of facts, please leave your comments, there is no better opportunity to learn from positive critiques :-)

Azure Functions gives you the freedom to create and execute a small or moderate size of code in C#, Java, JavaScript, Python, or PowerShell. This freedom releases you from a need to create a special infrastructure to host this development environment, however, you still need to provision an Azure storage account and App Insights to store your Azure Function code and collect metrics of its execution.


Azure Data Factory provides you with several ways to execute Azure Functions and integrate them into your data solution (this list is not complete):
- Web Activity
- Webhook Activity 
- Azure Function Activity

Web Activity
Connection and setting details:
- URL: you need to specify your Azure Function REST API endpoint
- MethodRest API method of your endpoint: "GET", "POST", "PUT", "DELETE", "PATCH".
- Body: JSON request details
- Dataset: Linked Service and dataset that you want to pass to your Azure Function
- Integration runtime
- Authentication: None (anonymous), Basic (with user name and password), MSI and Client Certificate
Pros:
Support of MSI authentication to use password-less connectivity within your Azure environment
- Easy way to execute your Azure Function with just URL, Method and Body for your endpoint call. 
- Use of Retry & Retry interval settings could help to restart failed ADF activities to call your function code
Cons:
Web Activity can only call publicly exposed URLs. It doesn't support URLs that are hosted in a private virtual network.
- The Web Activity will timeout after 1 minute with an error if it does not receive a response from the REST API endpoint (this timeout has nothing to do with a configured Azure Function timeout).

Webhook Activity
Connection and setting details:
- URL: you need to specify your Azure Function REST API endpoint
- MethodRest API method of your endpoint: "POST".
- Body: JSON request details
- Timeout: The timeout within which the webhook should be called back (default value is 10 minutes).
- Authentication: None (anonymous), Basic (with user name and password), MSI and Client Certificate
Pros:
Support of MSI authentication to use password-less connectivity within your Azure environment
- Easy way to execute your Azure Function with just URL, Method and Body for your endpoint call. 
Cons:
- The Webhook Activity will timeout after 1 minute with an error if it does not receive a response from the REST API endpoint (this timeout has nothing to do with a configured Azure Function timeout).
- The concept of using callbackUri property to return a response from your Azure Function is not well explained in the official documentation and may confuse some developers like me :-)

Azure Function Activity
Connection and setting details:
Azure Function Linked Service: reference point to your Azure Function App
Azure Function Namename of the function that you can access via Linked Service based Function App
MethodRest API method of your endpoint: "GET", "POST", "PUT"
Body: JSON request details
Authentication: Function App URL and access key for your Azure Function (configured in a Linked Service)
Pros:
Azure Function Linked Service function key could be sourced from a Key Vault which simplifies both storing/accessing this secret key as well as seamless deployment to other environments.
- Running time of your Azure function is more than 1 minute, however, it is still limited to 230 seconds regardless of your Azure Function timeout setting.
- Use of Retry & Retry interval settings could help to restart failed ADF activities to call your function code
Cons:
- Previously mentioned Azure Function ADF activity timeout limitation of 230 seconds may require the use of HTTP polling if your external code requires more time to complete.

Sample Code of Azure Function and its use in Azure Data Factory
Azure Function
I've created a very simple Azure PowerShell function to return a current time based on the time zone name provided as an input parameter:
using namespace System.Net

# Input bindings are passed in via param block.
param($Request, $TriggerMetadata)

# Write to the Azure Functions log stream.
Write-Host "PowerShell HTTP trigger function processed a request."

# Interact with body of the request.
$timezone = $Request.Body.timezone


if ($timezone) {
    $status = [HttpStatusCode]::OK
    $timelocal = Get-Date
    $body = [System.TimeZoneInfo]::ConvertTimeBySystemTimeZoneId($timelocal, [System.TimeZoneInfo]::Local.Id, $timezone)
    Write-Host "Requested TimeZone: $timelocal"
    Write-Host "Requested TimeZone: $timezone"
    Write-Host "Converted Time: $body"
}
else {
    $status = [HttpStatusCode]::BadRequest
    $body = "Please pass a name on the query string or in the request body."
}

#Start-Sleep -Seconds 100

# Associate values to output bindings by calling 'Push-OutputBinding'.
Push-OutputBinding -Name Response -Value ([HttpResponseContext]@{
    StatusCode = $status
    Body = $body
})

Web Activity execution in ADF
I pass the URL of my function, POST method and JSON Body request for the "Eastern Standard Time" zone:


Output Result
Successful execution of my ADF web activity returned a time value in the Output.Response attribute: 
 

Additional notes
1) After setting an artificial way for my function to run for more than 1 minute with the PowerShell "Start-Sleep" command, I received an error 2108 message, "A task was canceled. No response from the endpoint" that confirmed this Web Activity limitation in Azure Data Factory:
{ "errorCode": "2108", "message": "Error calling the endpoint ''. Response status code: ''. More details:Exception message: 'A task was canceled.'.\r\nNo response from the endpoint. Possible causes: network connectivity, DNS failure, server certificate validation or timeout.", "failureType": "UserError", "target": "Web Activity", "details": [] } 


Webhook Activity execution in ADF
I pass the URL of my function, POST method and JSON Body request for the "Eastern Standard Time" zone and set a time out to 9 minutes, just to test:


Output Result
Successful execution of my ADF Webhook activity returned a time value in the Output.Response attribute along with the Status code: 

Additional notes
1) Webhook requires a callBackUri property to be used to return a response from your Azure function. When you test your code, this property is added automatically to your JSON Body request. It's like you don't see it, but you have to explicitly use it in your code:
if ($Request.Body.callBackUri)
{
    $callBackUri = $Request.Body.callBackUri
    Write-Host "Received callBackUri: $callBackUri"
}

Which then you need to send back to your Data Factory from your function code:
#Call back to Azure Data Factory
if ($callBackUri)
{
    $OutputBody = "{ ""Response"":"""+$body+""", ""Status"":"""+$status+"""}"     
    Invoke-RestMethod -Method 'Post' -Uri $callBackUri -Body $OutputBody
}

2) After setting an artificial way for my function to run for more than 1 minute with the PowerShell "Start-Sleep" command, I received a BadRequest error message, "The request failed with status code '\"BadRequest\"" that confirmed this Webhook Activity limitation in Azure Data Factory:
{ "errorCode": "BadRequest", "message": "The request failed with status code '\"BadRequest\"'.", "failureType": "UserError", "target": "WebHook", "details": "" }


Azure Function Activity execution in ADF
I connect to my Azure Function via Linked Service connection along with the POST method and JSON Body request for the "Eastern Standard Time" zone:



Output Result
Successful execution of my ADF Azure Function activity returned a time value in the Output.Response attribute:



Additional notes
1) Azure Function requires your output to be formatted in JSON JObject format, otherwise, your azure function activity will fail and return the error message that Response Content is not a valid JObject.
# Associate values to output bindings by calling 'Push-OutputBinding'.
Push-OutputBinding -Name Response -Value ([HttpResponseContext]@{
    StatusCode = $status
    Body = "{ ""Response"":"""+$body+"""}"
})

2) After setting an artificial way for my function to run for more than 230 seconds (just for testing) with the PowerShell "Start-Sleep" command, I received a 3608 error message, "Call to provided Azure function'' failed with status-'BadGatewaywhich confirmed this Azure Function Activity limitation in Azure Data Factory.

Conclusion: I'm being positive :-)
1) Don't use ADF Web Activity to execute your Azure Function due to a number of limitations and contra arguments highlighted above.
2) Use ADF Webhook activity for a very short execution time code (execution less than 1 minute) if you require an explicit request and synchronous workflow.
3) Azure Function activity in ADF is the most favorable approach to execute the code of your Azure Function:
- More time to execute (still limited to 230 seconds)
- ADF pipeline code is cleaner (no use of hard-coded URLs) which makes this approach the best candidate for CI/CD pipelines in Azure DevOps.

However, if you really want to run very long Azure Functions (longer than 10, 30 or 60 minutes) and use Data Factory for this, you can: (1) Create a "flag-file" A in your ADF pipeline, (2) this "flag-file" A could be served as a triggering event for your Azure Function, (3) your Azure Function after this triggering event will run and at the end will create another "flag-file" B, (4) which could be served as a new triggering event for another pipeline in your Azure Data Factory. I have also written a blog post about this Event-driven architecture (EDA) with Azure Data Factory - Triggers made easy.

Happy Data Adventures!

Part 2: Using Durable Functions in Azure Data Factory - Support for long running processes in Azure Functions

Part 3: Using Azure Durable Functions with Azure Data Factory - HTTP Long Polling

Comments

  1. The best approach is using Durable functions and capturing the resultsQuery url. With this function can run as long as you want and you can query the status of the execution.

    ReplyDelete
    Replies
    1. Thanks, Petar, yes, I've looked at the Durable Functions as well. Currently, they don't support the PowerShell language.

      I had already working code in PowerShell to execute another executable program with external DLLs (1) to send a web API request, (2) then receive a file and save it in azure function internal storage account, (3) and then move this file into Azure data lakes storage. So I didn't think that I could pull this off with C#, F# of JavaScript in Durable Functions.

      But thanks, Petar, for your feedback. I'm just a regular developer, still a lot more to learn.

      Delete
  2. Great blog, Rayis. Very objective comparisons across the three approaches.
    Do you have any pointers/references on using Azure Functions as an activity within the pipeline for doing field level data transformations.
    For example - I have a CSV and I need to compress a particular column with my own custom compression algorithm which is exposed via an Azure Function. The output data set of this activity should have the particular column compressed value.

    ReplyDelete
    Replies
    1. I don't think you can still keep a binary column within a CSV file, you could upload your data in a SQL Server table with a binary column for example and then try to do your row/column based transformation.

      Delete
    2. Thank you very much for your nice post!

      I have a question:
      Do you have any experience with polling Durable functions via Azure Datafactory. In the web I see many solutions like this: https://endjin.com/blog/2019/09/azure-data-factory-long-running-functions

      But I would like to do something like this (polling only in one step):

      https://stackoverflow.com/questions/68844062/web-activity-runs-successfully-although-response-contains-failed

      Delete
    3. Thanks for your question. Yes, I've also blogged about using Durable Functions and how to poll them, their links are the end of this post and here:
      - https://datanrg.blogspot.com/2020/10/using-durable-functions-in-azure-data.html
      - https://datanrg.blogspot.com/2020/12/using-azure-durable-functions-with.html

      Delete
  3. What is the role of parameters in azure function linked service?

    ReplyDelete
    Replies
    1. I don't think "Parameters" were not available when I started working with the AF Linked Services in ADF. I would assume that would a reference points to the parameters exposed at the Configuration of the Function App, but that's my guess, since the documentation still has the old screenshot and doesn't describe parameters.

      Delete
  4. Hi, great article! However, I don't get the same output as you when using POST and Azure function activity, there is no response. When using GET, the outpu contains the response. Have the activities been changed since you wrote this article?

    ReplyDelete
    Replies
    1. I've just checked my Production ADF solution and Azure Function POST activity still produces its output results.

      Delete
  5. Hi, I have some question about "Operation on target Execute Activity failed: Operation on target polling status 200 failed: Specified cast is not valid." I'm not sure for solve this problem.

    ReplyDelete
    Replies
    1. Check your Azure Function App logs first before validating them in the ADF. Perhaps there is something wrong with FApp execution.

      Delete

Post a Comment