(2020-July-29) There is a well known and broadly advertised message from Microsoft
that Azure Data Factory (ADF) is a code-free environment to help you to create
your data integration solutions - https://azure.microsoft.com/en-us/resources/videos/microsoft-azure-data-factory-code-free-cloud-data-integration-at-scale/. I agree and support this approach of using drag and drop visual UI to build
and automate data pipelines without writing code. However, I'm also interested
to try if I can recreate certain ADF operations by writing code, just out of
my curiosity.
Previously I have written a blog post about using ADF Data Flow
Flatten operation to transform a JSON file - Part 1:
Transforming JSON to CSV with the help of Azure Data Factory - Mapping
Data Flows
This time I would like to check and compare Databricks code development
experience to Flatten the very same sourcing JSON file.
[ { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5007", "type": "Powdered Sugar" }, { "id": "5006", "type": "Chocolate with Sprinkles" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }, { "id": "0002", "type": "donut", "name": "Raised", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }, { "id": "0003", "type": "donut", "name": "Old Fashioned", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] } ]
(1) Data connection
First, I want to compare Databricks' understanding of this JSON file vs.
ADF Mapping Data Flow data connector.
ADF Mapping Data Flow provides the following file structure projection of
my JSON file:
This is how Databricks understood my JSON file, so both tools are in sync
in this.
(2) Flattening topping JSON array column
The very first use of the Flatten data transformation in my ADF data flow
expands the topping column:
Databricks use of the explode Spark function provides similar
results:
(3) Flattening batter JSON array column
The next use of the Flatten data transformation in my ADF data flow
expands the batter column:
Which doesn't look any different in my next Spark DataFrame with the
help of using the same explode function:
(4) Validating results in ADF Mapping Data Flow
My ADF data flow final transformation output matched with my
Databricks notebook DataFrame result: all necessary columns have
been flattened (or "exploded") and JSON data file schema in
both cases have been properly interpreted.
I'm just curious how different my appreciation for the ADF data
flow Flatten transformation would be if it was named Explode after
its Spark
explode
sibling :-), because the actual Spark flatten
function doesn't expand data structures into multiple rows but it
transforms an array of arrays into a single array.
So, my data adventures journey continues :-)
An "Explode" transformation sounded too dangerous while "Flatten" is more innocuous! :)
ReplyDeleteYes, I agree, in some cases, I've used the "Expand" term to explain similar functionality of flattering JSON datasets. Thanks, Mark!
Delete