(2020-May-24) It has never been my plan to write a series of articles about how I can work with JSON files in Azure Data Factory (ADF). While working with one particular ADF component I then had discovered other possible options to use richness and less constrained JSON file format, which in a nutshell is just a text file with one or more ("key" : "value") pair elements.
Sometimes you could have an array of those pair elements or an array of other arrays; now you see where I'm going with the name of my current blog post :-)
There are several ways how you can explore the JSON way of doing things in the Azure Data Factory. The first two that come right to my mind are:
(1) ADF activities' output - they are JSON formatted
(2) Reading JSON files - the task itself produces JSON formatted results too
It's not a new thing to know that we can reference nested elements of ADF activities' output since it's represented in JSON format or pass the JSON file content to other tasks/components that can process this format.
But what if you need to pass a complete output of your ADF activity task further down your pipeline. Or you need to pass a JSON array elements to another ADF activity or sub-pipeline as the parameter value. Let's explore what other options available in Azure Data Factory for this very interesting use case.
(1) ADF activities' output
Let's say I have a "Get Metadata" task in my ADF pipeline that reads the content of a particular folder in my storage account.
The main portion of the output for this activity may look this way:
I already know that I can get access to the list of files/folders using this ADF expression:
@activity('List of files and folders').output.childitems
Or getting to the first file by referencing a very first element [0] of this JSON array:
@activity('List of files and folders').output.childitems[0]
However, if I want to store the complete output of this activity into a separate Array variable with preserving the option of referencing all nested elements, then this can only be done by wrapping the output into array by using the function with the same name:
@array(activity('List of files and folders').output)
Otherwise you will get a data mismatch error message. This will become very helpful if you want to pass the JSON output into the next activity of your ADF pipeline to process. It works and it's fun! :-)
(2) Reading JSON files
Let's get more creative and read a real JSON file where you have more flexibility and control over its content.
Let's say I want to read the following JSON file of various baking lists of ingredients:
https://opensource.adobe.com/Spry/samples/data_region/JSONDataSetSample.html
The output of my ADF Lookup activity that references this file
will correctly show me 3 elements of this JSON array:
Where a list of items for my "For Each" loop activity can be simply set with the following expression:
@activity('Reading JSON file').output.value
Here is the interesting part, let's say I want to execute another ADF pipeline within my "For Each" loop activity and pass one baking recipe (or list of ingredients) as a parameter. Azure Data Factory is flexible enough and I can accomplish this with the following expression:
@array(item())
My sub-pipeline accepts this array parameter value and does further JSON data elements referencing jobs:
I can save incoming parameter value into (var_baking_payload variable) with this expression:
@pipeline().parameters.param_baking_payload
List of toppings (var_toppings variable) can be set with this expression:
@variables('var_baking_payload')[0].topping
List of batters (var_batters variable) can be set with this expression:
@variables('var_baking_payload')[0].batters.batter
The thing that I got excited the most while working with JSON file outputs in Data Factory pipelines was that I could still pass JSON arrays between different tasks/activities or pass those arrays to another sub-pipeline as a parameter, and nested elements referencing still worked!
If I'm the last one to learn this, I'm still excited :-)
Happy Data Adventures!
Sometimes you could have an array of those pair elements or an array of other arrays; now you see where I'm going with the name of my current blog post :-)
Previous blog posts on using JSON files in Azure Data Factory:
Part 1: Transforming JSON to CSV with the help of Azure Data Factory - Mapping Data Flows
Part 2: Transforming JSON to CSV with the help of Azure Data Factory - Wrangling Data Flows
Part 2: Transforming JSON to CSV with the help of Azure Data Factory - Wrangling Data Flows
There are several ways how you can explore the JSON way of doing things in the Azure Data Factory. The first two that come right to my mind are:
(1) ADF activities' output - they are JSON formatted
(2) Reading JSON files - the task itself produces JSON formatted results too
It's not a new thing to know that we can reference nested elements of ADF activities' output since it's represented in JSON format or pass the JSON file content to other tasks/components that can process this format.
But what if you need to pass a complete output of your ADF activity task further down your pipeline. Or you need to pass a JSON array elements to another ADF activity or sub-pipeline as the parameter value. Let's explore what other options available in Azure Data Factory for this very interesting use case.
(1) ADF activities' output
Let's say I have a "Get Metadata" task in my ADF pipeline that reads the content of a particular folder in my storage account.
The main portion of the output for this activity may look this way:
I already know that I can get access to the list of files/folders using this ADF expression:
@activity('List of files and folders').output.childitems
Or getting to the first file by referencing a very first element [0] of this JSON array:
@activity('List of files and folders').output.childitems[0]
However, if I want to store the complete output of this activity into a separate Array variable with preserving the option of referencing all nested elements, then this can only be done by wrapping the output into array by using the function with the same name:
@array(activity('List of files and folders').output)
Otherwise you will get a data mismatch error message. This will become very helpful if you want to pass the JSON output into the next activity of your ADF pipeline to process. It works and it's fun! :-)
(2) Reading JSON files
Let's get more creative and read a real JSON file where you have more flexibility and control over its content.
Let's say I want to read the following JSON file of various baking lists of ingredients:
https://opensource.adobe.com/Spry/samples/data_region/JSONDataSetSample.html
[ { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5007", "type": "Powdered Sugar" }, { "id": "5006", "type": "Chocolate with Sprinkles" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }, { "id": "0002", "type": "donut", "name": "Raised", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }, { "id": "0003", "type": "donut", "name": "Old Fashioned", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] } ]
The output of my ADF Lookup activity that references this file
will correctly show me 3 elements of this JSON array:
Where a list of items for my "For Each" loop activity can be simply set with the following expression:
@activity('Reading JSON file').output.value
Here is the interesting part, let's say I want to execute another ADF pipeline within my "For Each" loop activity and pass one baking recipe (or list of ingredients) as a parameter. Azure Data Factory is flexible enough and I can accomplish this with the following expression:
@array(item())
My sub-pipeline accepts this array parameter value and does further JSON data elements referencing jobs:
I can save incoming parameter value into (var_baking_payload variable) with this expression:
@pipeline().parameters.param_baking_payload
List of toppings (var_toppings variable) can be set with this expression:
@variables('var_baking_payload')[0].topping
List of batters (var_batters variable) can be set with this expression:
@variables('var_baking_payload')[0].batters.batter
If I'm the last one to learn this, I'm still excited :-)
Happy Data Adventures!
Where do you store your JSON file?
ReplyDeleteAzure Storage Account
DeleteWhat if the json file has column names and row values separated like below? How can you transfer this to a structure so ADF can map the structure to a table like structure?
ReplyDelete{
"count" : 3
"name" : json_file
"columns" :
[
"column1"
"column2"
"column3"
],
"rows" :
[
[
"row1value1"
"row1value2"
"row1value3"
],
[
"row2value1"
"row2value2"
"row2value3"
],
[
"row3value1"
"row3value2"
"row3value3"
]
]
}
Hi Leroy, I think your JSON example is not valid in a first place. Try to validate it first and then ADF or any other tool can accept for processing.
DeleteHello, great article. Do you know how to extract all file names into a new array from the Metadata folder activity? I know you can loop it, but in my case this is not a viable solution because of number of elements. Any help would be highly appriciated.
ReplyDelete1) Loop though the output of your Metadata
Delete2) within you loop use [Append Variable] to populate your array variable
Thanks for your feedback, I did try that, but my JSON file has 5000+ elements, takes loonngg time to loop and append
DeleteI though about that too. If you don't care about the order of your array with the file names. Them make your ForEach (isSequential = False, and batchCount set to 50). I think [Append Variable] should still support parallel appending to itself.
DeleteHi what is the conditional statement you kept in reading json file lookup activity to read the inside data
ReplyDeleteThe lookup activity doesn't have any conditional logic but a reference to the actual JSON file. The output of this activity (@activity('Reading JSON file').output.value) is then pass to the ForEach container.
DeleteHi great article , can you tell me which conditional statement you have used in reading json file to get the count as 3
ReplyDeleteThe count of 3 didn't come from a conditional statement. The count of 3 simply came our as a result of reading a JSON file using Lookup Activity, and this JSON file contained an JSON array with 3 elements. If the same JSON file had 100s of array elements, it would be reflected in the Lookup activity output accordingly.
Delete