Transforming JSON to CSV with the help of Flatten task in Azure Data Factory

(2020-Mar-19) Recently, Microsoft introduced a new Flatten task to the existing set of powerful transformations available in the Azure Data Factory (ADF) Mapping Data Flows - https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten.

What this new task does it helps to transform/transpose/flatten your JSON structure into a denormalized flatten datasets that you can upload into a new or existing flat database table.

2020-Mar-26 Update:
Part 2Transforming JSON to CSV with the help of Flatten task in Azure Data Factory - Part 2 (Wrangling data flows) 

I like the analogy of the Transpose function in Excel that helps to rotate your vertical set of data pairs (name : value) into a table with the column names and values for corresponding objects. And when this vertical JSON structural set contains several similar sets (array) then ADF Mapping Data Flows Flatten does a really good job by transforming it into a table with several rows (records).


Let's use this JSON data file as an example

{
 "id": "0001",
 "type": "donut",
 "name": "Cake",
 "ppu": 0.55,
 "batters":
  {
   "batter":
    [
     { "id": "1001", "type": "Regular" },
     { "id": "1002", "type": "Chocolate" },
     { "id": "1003", "type": "Blueberry" },
     { "id": "1004", "type": "Devil's Food" }
    ]
  },
 "topping":
  [
   { "id": "5001", "type": "None" },
   { "id": "5002", "type": "Glazed" },
   { "id": "5005", "type": "Sugar" },
   { "id": "5007", "type": "Powdered Sugar" },
   { "id": "5006", "type": "Chocolate with Sprinkles" },
   { "id": "5003", "type": "Chocolate" },
   { "id": "5004", "type": "Maple" }
  ]
}

and create a simple ADF mapping data flow to Flatten this JSON file into a CSV sink dataset.

On a high level my data flow will have 4 components:
1) Source connection to my JSON data file
2) Flatten transformation to transpose my Cake to Toppings
3) Further Flattent transformation to transpose my Cake > Toppings to Batters
4) Sink output Flatten result in a CSV file


(1) Source connection to my JSON data file
Connection to my JSON file is simple, however, it's interesting to see how the output of my consumed JSON file is shown in the Data Preview tab, which shows one row with several array objects.




(2) Flatten transformation to transpose my Cake to Toppings
My next Flatten transformation task transposes the JSON Topping array with 2 objects (id, type) into 7 flatten rows, where JSON Batter objects are now visible as individual arrays.


(3) Further Flattent transformation to transpose my Cake > Toppings to Batters
All 7 records that came out from the previous Flatten transformation task can now be used as input for my further Flatten transformation. This helps to convert (unroll) 2 additional fields from the Batter JSON subset (id, type).


(4) Sink output Flatten result in a CSV file
Data Preview option in my Sink doesn't get changed from its sibling in the previous task, so I thought to challenge the Data Factory and replaced my initial JSON file that contained one object with another file that would contain several similar objects.
[
 {
  "id": "0001",
  "type": "donut",
  "name": "Cake",
  "ppu": 0.55,
  "batters":
   {
    "batter":
     [
      { "id": "1001", "type": "Regular" },
      { "id": "1002", "type": "Chocolate" },
      { "id": "1003", "type": "Blueberry" },
      { "id": "1004", "type": "Devil's Food" }
     ]
   },
  "topping":
   [
    { "id": "5001", "type": "None" },
    { "id": "5002", "type": "Glazed" },
    { "id": "5005", "type": "Sugar" },
    { "id": "5007", "type": "Powdered Sugar" },
    { "id": "5006", "type": "Chocolate with Sprinkles" },
    { "id": "5003", "type": "Chocolate" },
    { "id": "5004", "type": "Maple" }
   ]
 },
 {
  "id": "0002",
  "type": "donut",
  "name": "Raised",
  "ppu": 0.55,
  "batters":
   {
    "batter":
     [
      { "id": "1001", "type": "Regular" }
     ]
   },
  "topping":
   [
    { "id": "5001", "type": "None" },
    { "id": "5002", "type": "Glazed" },
    { "id": "5005", "type": "Sugar" },
    { "id": "5003", "type": "Chocolate" },
    { "id": "5004", "type": "Maple" }
   ]
 },
 {
  "id": "0003",
  "type": "donut",
  "name": "Old Fashioned",
  "ppu": 0.55,
  "batters":
   {
    "batter":
     [
      { "id": "1001", "type": "Regular" },
      { "id": "1002", "type": "Chocolate" }
     ]
   },
  "topping":
   [
    { "id": "5001", "type": "None" },
    { "id": "5002", "type": "Glazed" },
    { "id": "5003", "type": "Chocolate" },
    { "id": "5004", "type": "Maple" }
   ]
 }
]

Source file Data Preview correctly showed me 3 rows. All next Flatten transformation tasks' outputs were tripled in their results, and my final Output file contained all 41 expected records!



Well done, Microsft team! 

I really like this visual way to transform (flatten) a sourcing JSON stream into a CSV file.

Comments

  1. I am new to this and I cannot to get pass step 2 with your JSON file you provided. I had to manually add mapping topping.id and topping.type in flatten settings; and in Data Preview, it tells me those 2 columns are empty.

    ReplyDelete
    Replies
    1. I've saved the data flow from the blog post in my personal github repository: https://github.com/NrgFly/Azure-DataFactory/blob/master/Samples/dataflow/wdf_json_file.json

      Delete
  2. Could you do a similar step by step demo help converting a csv file to a json structure with array objects?

    ReplyDelete
    Replies
    1. I don't have any plans to create such a demo, however it would be an interesting case to work with.

      Delete
  3. Can we do same in mapping data flow? because when i try to do i am not been successful.

    ReplyDelete
    Replies
    1. The whole purpose of this post was to show that it is possible in ADF Mapping Data Flow. Could you be more specific (code repository, file you try to process) on what you had not been successful?

      Delete
  4. Quiz? This dataflow is linked to a pipeline to execute with the transformations

    ReplyDelete
    Replies
    1. You create a data flow first and then you create a pipeline to execute this data flow task from in the same data factory.

      Delete
    2. rayis, thank you, i have other question how read multiples files json for generated files csv, you need to use for each? or setting in data flow

      Delete
    3. Technically, Mapping Data Flow supports file mask definition of your sourcing dataset. So that multiple files, i.e. from the same folder, with a similar file structure could be read at once.

      Delete
  5. The most grateful blog! Great site! It looks great! Continue in the same spirit!

    ReplyDelete
  6. If I have a text delimited file containing three columns - EmpId, Empname and Detail. The detail column contains the JSON string. Lets suppose this is the JSON:

    {
    "id": "0001",
    "type": "Permanent",
    "name": "ABC"
    }

    Would it be possible to create an output flat file using Data flow in such scenario? e.g. an output file containing columns- EmpId, Empname, id, type, name

    ReplyDelete
    Replies
    1. Yes, this should be possible. You can try using CopyData activity in the ADF control flow with setting JSON dataset as as source and CSV as your sink destination, or using mapping data flow method from this blog post.

      Delete
    2. I am trying to use flatten using data flow as described in the above approach. I have created a text delimited dataset with pipe (|) as delimiter. This would act as a source in my scenario. As expected, my data source projection shows EmpId as Int, EmpName as string and detail as string.

      No when I try to use flatten, I am unable to choose any column in "Unroll by" as all my columns are in string format. Is there a work around for it ?

      Delete
    3. I don't think you will need to use the '|' character for your file, since a JSON file is not a delimited text file but a structure with different elements. Please read the examples of JSON formatted files that can be supported in ADF and adjust your sourcing file accordingly - https://docs.microsoft.com/en-us/azure/data-factory/format-json#json-file-patterns

      Delete
    4. That's the problem I am facing, I am not receiving the JSON file but a text delimited file containing three columns - EmpId, Empname and Detail

      This is the data that I am receiving in the source file :

      EmpId|EmpName|Detail
      100|Emp1|{"id": "0001","type": "Permanent","name": "ABC"}

      Although, the 'detail' column contains a JSON data, the dataset treats it as a string

      Delete
    5. This is good case, I though that I could Wrangling Data Flow in ADF to expand one JSON column, but I received the following error, "The Power Query Spark Runtime does not support the function Table.ExpandRecordColumn.". Let me think, how else to convert text column into JSON and then expand in ADF. You can always use a Databricks notebook, pass a filename as a parameter and call it from your ADF pipeline.

      Delete
    6. Thank you Rayis! If you find a solution to this issue using data flow then please let me know, otherwise I will use databricks.

      Delete
    7. I've played a little in ADF mapping data flow, couldn't make one column to be treated as a JSON element. I think it's a current limitation of ADF when it takes one dataset either CSV or JSON, but not mixed. I've blogged about comparing Databricks with ADF Mapping Data flow for this - http://datanrg.blogspot.com/2020/07/transforming-json-data-with-help-of.html. The other thing that you can try to do, in ADF data flows is to try manually parsing your 'Detail' column into separate text columns, but this will require a lot of String functions and this may work easily if your JSON structure is not too complicated.

      Delete
  7. Hi! Great tutorial, this is exactly what I was looking for to continue a particular job I am working on. However, I am noticing this is not working if I read a folder with several .json files over it (more that 500k) will this be the best approach for such project?

    ReplyDelete
    Replies
    1. Try to test this approach using only one file, then check if there is any performance issues using bigger files. If it still doesn't work, then I would suggest using Databricks for file transformation.

      Delete
  8. Hi Rayis, Thanks for helping us with different scenarios and examples in reading JSON files and flattening. I am having a huge JSON with 2 GB data. My Pipeline works fine with small JSON files with same pattern but when I initiate a 2 GB file it fails reading at the source itself. Could you please help me with this. Thank You. Ravi

    ReplyDelete
    Replies
    1. This could be a good case for the Microsoft support team. You can explain the case where similar pipelines works well, but similarly structured JSON files fails to be processed. Their technical support team can make further recommendations based on the error message you're getting. I haven't worked with such huge JSON files in Data Factory, was thinking that more Spark notebook approach would work better where you have better control. But I will let Microsoft team to advise otherwise.

      Delete

Post a Comment