How I tested Parse data transformation in Azure Data Factory

(2021-Mar-15) While watching the first MiB movie with my family, it was quite clear that the character played by Will Smith at first didn’t have a clear idea about what the “galaxy is on Orion's belt” phrase really meant, {spoiler alert} only then to understand that a galaxy could be as small as a marble, containing billions of stars.

A similar idea in a database, where one table column could contain a whole universe of another subset of data, can also expand your unrestrained horizons to help you to see the potential volume of structured (or less structured) information.

I remember back in the early 2000s where an XML column type was introduced to the SQL Server,  many battles were thought to discourage such idea of a complex data type within a table, where one of the main arguments was a slow performance to read/write this data. 

With JSON data type (or actually a more complex text column, and in SQL Server interpretation it could be defined as VARCHAR(MAX)), I don’t have a desire to start another argument, but just try to imagine how much text data could be stored in a single column of possible 2Gb in size. It is just like a whole database in a single line of a record, isn't this exciting :-)

Recently, Microsoft Azure Data Factory (ADF) product team delivered a new Parse transformation that can be used both for complex JSON and text delimited data: https://docs.microsoft.com/en-us/azure/data-factory/data-flow-parse. So, I decided before using it in a real project to give it try and test on a simple JSON data set, and that’s why I named my post this way.

The original JSON sample data set has been found here: https://opensource.adobe.com/Spry/samples/data_region/JSONDataSetSample.html

I have already used it in my attempt to expand/flatten/explode a JSON dataset with the help of Flatten data transformation, now it’s time to see what I can do with the Parsing ADF capability: http://datanrg.blogspot.com/2020/03/transforming-json-to-csv-with-help-of.html.

Originally, if a sourcing dataset is interpreted (or presented) for ADF as JSON data, then this dataset could be easily flattened into a table with two steps to (1) Flatten JSON Toppings first and then (2) Flatten more complex JSON Batters data element, as well carrying other regular id, type, name, and ppu data attributes.

When you push your data workload and its dataset is not entirely JSON formatted (e.g. a regular table, with one or more JSON formatted columns), then this new Parse data transformation becomes very practical. 

Looking at the same JSON data as a text column we would need to make a few more additional steps: (1) Parse this JSON data column, then (2) expand the Toppings by Flatten transformation, (3) Parse JSON Topping and Batters, (4) Flatten JSON array of batter and then finally (5) Parse the expanded JSON batter set of attributes. It looks easy, let’s see how it works in reality.

(1) Parse JSON data column

Transformation settings

Just to prove the fact my JSON data set is not forcefully interpreted as JSON, but simply presented as a text file with one String column that contains 3 rows of data.


Then in the Parse settings, I have to configure my output column in this way:
Column: json_data_column_parsed
Expression: json_data_column
Output column type: (id as string, type as string, name as string, ppu as float, topping as string[], batters as string)

Output dataset

This will result in the following definition of the output result of this step. I can see 4 simple data columns along with topping string array and simple, but yet even more complex batters attribute (will review this later). Brief observation provides me with a hope that Parse transformation works well.

Data Preview


(2) Flatten Toppings

Having high hopes in the Parse transformation, I still needed to expand the Topping [...] array value, and the Flatten transformation was my next step. 

Please check Microsoft documentation on how to how to use it or my personal blog post of flattening the very same dataset.

Both batters and json_topping columns remained as text fields, but now json_topping has revealed its subset of JSON elements.

Transformation settings

Output dataset

Data Preview


(3) Parse JSON Topping and Batters

Following the expansion (or flattening) of the json_topping, it was quite noticeable that I could use the same Parse transformation to get explicitly extract the id and type of my JSON topping construction. Since the Parse transformation allows me to chose multiple columns, there was nothing wrong to add the extraction of the batter JSON elements from its JSON batters parents as well.

Transformation settings

Column: topping_id_and_type
Expression: json_topping
Output column type: (id as string, type as string)

Column: batters_expand
Expression: batters
Output column type: (batter as string[])

Output dataset

Data Preview


(4) Flatten JSON Batter 

Let's work on our next transformation to go a bit deeper into the batters by flattening them :-) This provides me with more visibility into the id and type of my JSON batter construction, but this will still be a text field, and I will get to Parse it my next transformation.

Transformation settings

Output dataset

Data Preview


(5) Parse JSON Batter

Finally, nothing more to hide, let's extract the batter's attributes and transform them into separate columns. 

Transformation settings

Column: batter_id_and_type
Expression: batters_topping
Output column type: (id as string, type as string)

Output dataset

Data Preview


Summary (or things to remind me next time I work with Parse in ADF)

  1. JSON settings (Single document, Document per line, Array of documents) at each Parse transformation is VERY important to define whether your files contain a single JSON, multiple JSONs in separate lines, or in an array format. I've had multiple errors when this setting wasn't set correctly.
  2. Not to forget to give different column names at each transformation step (and this is applicable to both Parse & Flatten); I had several Spark error messages when similar sourcing and output columns were colliding with each other.
  3. It helps to do a data profiling first before parsing it, it will help to plan your transformations steps.
  4. Sometimes Parsing in ADF was a bit slower than I expected even on a small dataset.
  5. Will I use Parse in ADF? Yes: it works well with simple JSON elements. Will I use with more complex JSON structure, like arrays? Maybe: I still need to do more testing to see how it works on a large dataset.

Generally speaking, I liked how the Parse transformation worked with my JSON sample dataset and this can be used in a real data project as well. Let's now find out where the real JSON galaxy” is hidden :-)

Data flow tested in this blog post can be found in my GitHub repository:

Comments

  1. Have you tried anything similar with XML? Not having any luck with parsing XML

    ReplyDelete
    Replies
    1. No I haven't had a case to source XML data. Based on the documentation this would still be possible in the Data Flows too: https://docs.microsoft.com/en-us/azure/data-factory/format-xml

      Delete

Post a Comment