Search This Blog

Tuesday, November 27, 2018

Email Notifications in Azure Data Factory: Failure is not an option

(2018-Nov-27) When something goes wrong with your data transformation process in Azure Data Factory, the last thing you expect to happen is to lose control and visibility. You want to be notified right after this error event occurs or you need your technical support team to become aware and engage to resolve this issue in a timely manner. So, in this case, an email alert could be a very good tool to provide an appropriate way of communication.



Currently, as I'm writing this blog post, Azure Data Factory doesn't provide an out-of-box functionality to send emails. However, we can use Web Activity task in ADF that would use a custom Azure Logic App to help us to deliver an email notification.

Writing this blog, I give a full credit to this resource - http://microsoft-bitools.blogspot.com/2018/03/add-email-notification-in-azure-data.html by Joost van Rossum from Netherland, that has step-by-step instructions to create ADF Notification Logic Apps in Azure. I have just made a slight modification to this solution to support the functionality of the existing ADF pipeline from my previous blog post - http://datanrg.blogspot.com/2018/11/system-variables-in-azure-data-factory.html

In my current ADF pipeline, I want to implement email notification functionality in two possible cases:
Case A - Email Notification or failed Copy Data task (explicitly on a Failed event)
Case B - Email Notification for logged even with Failed Status (Since the logging Information would have data for both Failed and Succeeded activities, I would like to generate email notifications only for the Failed events at the very end of my pipeline).





So, first, I would need to create a Logic App in my Azure portal that would contain two steps:
1) Trigger: When an HTTP POST request is received 
2) Action: Send Email (Gmail) 


The Trigger would have this JSON body definition:

{
    "properties": {
        "DataFactoryName": {
            "type": "string"
        },
        "PipelineName": {
            "type": "string"
        },
        "ErrorMessage": {
            "type": "string"
        },
        "EmailTo": {
            "type": "string"
        }
    },
    "type": "object"
}

and Gmail Send Email action would further use elements of this data construction.




So, to support my "Case A - Email Notification or failed Copy Data task":
1) I open my pipeline "Load data into SQL" container and add & connect Web Task for a failed event from the "Copy Data Blob to SQL" activity task.

2) Then I copy HTTP Url from the Logic App in the Azure portal and paste its value in the Web task
3) I add a new header with the following values:
     Name: Content-Type
     Value: application/json
4) And then I put a body definition to my POST request and fill all the data elements for the initially defined JSON construction of the Logic App:

{
    "DataFactoryName":
        "@{pipeline().DataFactory}",
    "PipelineName":
        "@{pipeline().Pipeline}",
    "ErrorMessage":
        "@{activity('Copy Data Blob to SQL').Error.message}",
    "EmailTo":
        "@pipeline().parameters.var_email_address"
}


To support my "Case B - Email Notification for logged even with Failed Status":
1) I open my pipeline "Logging to SQL" container and add another container to check if the logged event has a Failed status by the following conditional expression @equals(split(item(),'|')[12], 'Failed')

2) Then I add two activity tasks within this Condition activity to support email notifications:


3) Similarly to the previous Case A, I copy HTTP Url from the Logic App in the Azure portal and paste its value in the Web task and add the same new header.
4) And then I put a body definition to my POST request and fill all the data elements for the initially defined JSON construction of the Logic App:

{
    "DataFactoryName":
        "@{pipeline().DataFactory}",
    "PipelineName":
        "@{pipeline().Pipeline}",
    "ErrorMessage":
        "@variables('var_activity_error_message')",
    "EmailTo":
        "@pipeline().parameters.var_email_address"
}

where "var_activity_error_message" variable is defined by this expression @split(item(),'|')[4] in the pipeline "Set Variable" activity task.

As a result, after running a pipeline, along with the expected results I now can receive an email notification if something fails (to test the failed event notification, I temporarily changed the name of a table that I use as a destination in my Copy Data task). 




It's working!

Tuesday, November 20, 2018

System Variables in Azure Data Factory: Your Everyday Toolbox

(2018-Nov-20) After working and testing the functionality of variables within Azure Data Factory pipelines, I realized that it's worth to explore existing system variables. That's basically could be my toolbox to collect and store control flow metrics of my pipelines.



Looking at the official Microsoft resource System variables supported by Azure Data Factory you're given with a modest selection of system variables that you can analyze and use both on a pipeline and pipeline trigger level. Currently, you have three ways to monitor Azure Data Factory: visually, with the help of Azure Monitor or using a code to retrieve those metrics.

But here is a case of how I want to monitor a control flow of my pipeline in Azure Data Factory:



This the same data ingestion pipeline from my previous blog post - Story of combining things together that builds a list of files from a Blob storage and then data from those files are copied to a SQL database in Azure. My intention is to collect and store event information of all the completed tasks, such as Get Metadata and Copy Data.

Here is a current list of pipeline system variable in my disposal:
@pipeline().DataFactory - Name of the data factory the pipeline run is running within
@pipeline().Pipeline - Name of the pipeline
@pipeline().RunId - ID of the specific pipeline run
@pipeline().TriggerType - Type of the trigger that invoked the pipeline (Manual, Scheduler)
@pipeline().TriggerId - ID of the trigger that invokes the pipeline
@pipeline().TriggerName - Name of the trigger that invokes the pipeline
@pipeline().TriggerTime - Time when the trigger that invoked the pipeline. The trigger time is the actual fired time, not the scheduled time.

And after digging a bit more and testing pipeline activity, I've discovered additional metrics that I can retrieve on the level of each individual task:
PipelineName, 
JobId, 
ActivityRunId, 
Status, 
StatusCode, 
Output, 
Error, 
ExecutionStartTime, 
ExecutionEndTime, 
ExecutionDetails, 
Duration

Here is my final pipeline in ADF that can populate all these metrics into my custom logging database table:



And this is how I made it work:

1) First I created dbo.adf_pipeline_log table in my SQL database in Azure:


2) Then I used [Append Variable] Activity task as "On Completion" outcome from the "Get Metadata" activity with the following expression to populate a new array type var_logging variable:




var_logging = 
@concat('Metadata Store 01|Copy|',
,pipeline().DataFactory,'|'
,activity('Metadata Store 01').Duration,'|'
,activity('Metadata Store 01').Error,'|'
,activity('Metadata Store 01').ExecutionDetails,'|'
,activity('Metadata Store 01').ExecutionEndTime,'|'
,activity('Metadata Store 01').ExecutionStartTime,'|'
,activity('Metadata Store 01').JobId,'|'
,activity('Metadata Store 01').Output,'|'
,pipeline().Pipeline,'|'
,activity('Metadata Store 01').ActivityRunId,'|'
,activity('Metadata Store 01').Status,'|'
,activity('Metadata Store 01').StatusCode)

where each of the system variables is concatenated and separated with pipe character "|". 

I did a similar thing to populate the very same var_logging variable in the ForEach container where actual data copy operation occurs:



3) And then I used this final tasks to populate my dbo.adf_pipeline_log table using data from the var_logging variable by calling a stored procedure:


Where the whole trick is to split each of the text lines of the var_logging variable into another array of values split by "|" characters. Then by knowing the position of each individual system variables values, I can set them to their appropriate stored procedure parameters / columns in my logging table (e.g. @split(item(),'|')[0] for the ActivityTask).




This provided me a flexibility to see both Completed and Failed activity runs (to test a failed activity I had to temporarily rename the target table of my Copy Data task). I can now read this data and get more additional insights from the SQL Server table. 



Let me know what you think about this, and have a happy data adventure!

Monday, October 29, 2018

Append Variable activity in Azure Data Factory: Story of combining things together

(2018-Oct-29) There are only a few sentences in the official Microsoft web page that describe newly introduced activity task (Append Variable) to add a value to an existing array variable defined in Azure Data Factory - Append Variable Activity in Azure Data Factory But it significantly improves your ability to control a workflow of the data transformation activities of your Data Factory pipeline.



Suppose, you have several folders in your Azure Blob storage container, where daily product inventory data from different stores is saved. You need to transfer those files with product stock daily snapshots to a database.



It will be a natural way to get a list of files from all sourcing folders and then load them all into your database.

Technically:
1) We can read metadata of our sourcing folders from the Blob storage
2) Then we can extract all the files names and save them in one queue object
3) And finally, use this file list queue to read and transfer data into a SQL database. 

Let's recreate this use case in our Azure Data Factory pipeline.

1) To get metadata of our sourcing folders, we need to select "Child Items" for the output of our [Get Metadata] activity task:



Which provides a list of sub-folders and files inside the given folder with a list of name and type of each child item. 
Here is an output of this task using a list of files from my first store:


2) Then I need to extract file names. To make it happen I pass the output of the [Get Metadata] activity task to my [ForEach] container. 
Where Items parameter is set to @activity('Metadata Store 01').output.childitems.



And then within this loop container, I actually start using this new [Append Variable] activity task and specifically choose to extract only names and not types from the previous task output set. Please note that this task can only be used for 'Array' type of variables in your pipeline.



Then all that I have to do is to replicate this logic for other sourcing folders and stream the list of the file names to the very same variable. 
In my case it's var_file_list. Don't forget to define all the necessary variables within your pipeline.


3) Once all the file names are extracted into my array variable, then I can use this as a queue for my data load task. My next [ForEach] loop container's Items parameter will be set to this value: @variables('var_file_list')


And internal [Copy Data] activity task within this loop container will receive file names by using individual item names from my variable loop set [@item()].


List of files is appended from each sourcing folders and then all the files are successfully loaded into my Azure SQL database. Just to check a final list of file names, I copied the content of my var_file_list variable into another testing  var_file_list_check variable to validate its content.



Azure Data Factory allows more flexibility with this new [Append Variable] activity task and I do recommend to use it more and more in your data flow pipelines! :-) 

Monday, October 15, 2018

Setting Variables in Azure Data Factory Pipelines

(2018-Oct-15) Working with Azure Data Factory you always tend to compare its functionality with well established ETL packages in SSIS. Data flow task have been recreated as Data Copy activities; logical components have found they cloud-based siblings; as well as new kids on the block, such as Databricks and Machine Learning activities could boost adoption rate of Azure Data Factory (ADF) pipelines.

Support for local variables hasn't always been available in ADF and was only recently introduced to already available pipeline parameters. This addition makes more flexible to create interim properties (variables) that you can adjust multiple times within a workflow of your pipeline.



Here is my case-study to test this functionality.

I have a simple SQL Database with 2 tables that could hold daily and monthly sales data which I plan to load from a sample set of CSV data files from my Blob storage in Azure.



My new ADF pipeline has an event trigger that passes a file path and file name values from newly created objects in my Blob storage container:


The logic then would be to check a data feed type (Daily or Monthly) based on a file name and load data to the corresponding table in SQL Database in Azure.


And here is where this [Set Variable] activity comes as a very handy tool to store a value based on a define expression of my variable:





Then I define two sub-tasks to copy data from those flat files into corresponding tables based on the value of the FeedType variable:



And once this all being done, I place two different data files into by Blob storage container and ADF Pipeline trigger successfully executes the same pipeline twice to load data into two separate tables:






Which I can further check and validate in my Azure SQL Database:


My [Set Variable] activity has been tested successfully! 
And it's one more point toward using ADF pipelines more often.

Thursday, August 30, 2018

Maps in Power BI. Do I really need them?

(2018-Aug-30) While preparing for a recent Power BI Toronto meetup session, I found a very valuable whitepaper on Maps in Power BI written by David Eldersveld from the BlueGranite consulting company, and I had used some of the ideas and examples from that document in my presentation.

There is one idea that I couldn't agree more which could be traced and found in many of my Power BI report developments (I admit) and reports developed by others. We've integrated external data source systems, created key metrics in our data model, identified that some of the data elements could be categorized as geolocations in our datasets, and... then we want to start using maps visualizations right away, but do I really need them?

If my dataset contains a customer location info or list of countries where all the tweet messages came from based on a recent marketing campaign then I'm tempted to visualize them. Let's take an example of the following report with the Canadian population by Province. What would be the best way to analyze the population itself, a table, a map, a bar or a pie chart or combination of all of them?





I could easily sort the table by the Population column or look at another non-map visualization (Donut chart, for example, developed by ZoomCharts company) to realize that Quebec is the second largest province by its population.




or I could just select the least populated provinces and territories to find out that their ratio vs. other populated area is less than 1%.



So I wouldn't suggest using Map visualization in Power BI just because my data can allow this to happen. Use other more effective tools or visual components to present your data; combine them with maps, and most of all create a story with your visualizations that will help your audience to understand it better.

It only takes to learn maps visualizations to understand that sometimes something else could be used instead :-)

Sunday, July 29, 2018

ArcGIS Maps for Power BI: Plus subscription (2nd part); What's in it for me?

(2018-July-29) My initial attempt to explore and test the ArcGIS Map paid version continued after receiving a response from the ESRI support team. Initially, I couldn't see all the additional features of the Plus subscription and I wrote about that in my last blog post - ArcGIS Maps for Power BI: Free vs. Paid version



But then I received a response from the Canadian Tech ESRI support team on my request to look into the issue of non-connectivity to Plus subscription content in Power BI. Basically, it communicated that all things were OK with the map itself and I needed to check further with Microsoft Power BI support team.



Well, I was a bit disappointed with this outcome; in addition to this, I received a few comments from other people about their unsuccessful attempts to resolve the very same issue with ESRI support team.

The next day I received two other messages: one with a confirmation that my ArcGIS Map Plus subscription had been activated and another email from a business analyst from the ESRI team in the US with a request to check if I still had any issues connecting to my account and additional help was offered if I needed. This restored my faith in people, and they willing to help others :-) Plus, after quickly checking my original Power BI report, I was able to connect to the Plus subscription of ArcGIS MAP, yes!





So, this is my story of checking all the new features available in the paid version of the ArcGIS Map in PowerBI.


Basemaps

First, the basesmaps add a contextual background to your map geo points. And with the Plus subscription you have access to 8 new basemaps, identified by a Plus badge on the screen:



New basemaps are:

- Imagery
- Imagery with Labels
- National Geographic
- Oceans
- Terrain with Labels
- Topographics
- USA Topo Labels
- USGS National Map (USA)

Source: http://doc.arcgis.com/en/maps-for-powerbi/design/change-the-basemap.htm

And my favorite basemep among those eight ones is the National Geographic, it hard to tell why. Probably, because it combines both technical geocoding details like any standards maps, but at the same time, it bears that graphical look from past historical sea voyages into the future endeavors to explore oceans, and I just like it :-)


Reference Layer

With Plus subscription you have access to Live Atlases, which can enrich you map look with other graphical layers on the top of your existent basemep. The search option allows you to examine available layers that you can use. Currently, most of the layers are US based data, but I found a North America Ecoregions layer that showed Canada in so many different colors. Just take a look!



Source: http://doc.arcgis.com/en/maps-for-powerbi/design/add-a-reference-layer.htm


Infographics



Infographics are just "floating" data cards that you can add to your map with some statistical data  (population, age distribution, income levels, etc.) to support either your whole map visualization or a part of it.



The beauty of those infographics data cards is that their content is updated depending on what you're doing with your ArcGIS. If you select one or more features on the map, the cards will show demographic information for an area around each selected location. See, how the Toronto infographics data gets changed when I zoom my view in to a core downtown area.





Source: http://doc.arcgis.com/en/maps-for-powerbi/design/add-infographics.htm

I won't be making any conclusions at the end. If you prefer using MapBox in Power BI, it's your choice; if you want to give a try to ArcGIS Plus subscription or other map visualization in Power BI, go for it!

For me, this version of the ArcGIS Map also reminded me that I needed to purchase a National Geographic Magazine subscription for my daughter :-)

Happy data adventures!

Wednesday, July 25, 2018

ArcGIS Maps for Power BI: Free vs. Paid version

(2018-July-25) I like the ESRI company slogan, "The Science of Where". Hopefully, it will help someone not to get lost in the myriads of geo points on their maps. 

Working with ArcGIS map visualization in Power BI, you always hit that notification point that tells you about more features available in the non-free version of the ArcGIS map, so I wanted to explore this.



I pulled an opened dataset of TTC Routes and Schedules with the route definitions, stop patterns, stop locations, and schedules of the Toronto TTC ground buses and streetcars into my Power BI data model.



With the following table count statistics:
Stops - 10,614 records
Trips - 134,949 records
Routes - 200 records
Stop Times - 5,584,011 records

And this data model helped me to plot all (almost all) stops' geo locations in my ArcGIS map:



My next step was to sign up for the Plus subscription (Paid version of the ArgGIS map) in Power BI. The registration went smoothly, my credit card information has been collected, no warning signs. However, when I tried to sign in again and authenticate myself with my registration email address, I received an error message that something went wrong with ArcGIS Maps for Power BI with apologies for the inconvenience and request to refresh the page or check back later. I contacted the Canadian ESRI tech support but still haven't received any updates for this issue.

So, I was only left with the option to trust the ESRI documentation and check limitations of the free version of the ArcGIS Map in Power BI. If you read the documentation, basically you will get an idea of "more", there a couple of additional new features, but the Plus version of the map has all the standard features with more capacity:


Included with Power BIWith a Plus subscription
Basemaps
4 basic basemaps
4 basic basemaps and 8 others, including satellite imagery
Geocoding
1,500 features per map
5,000 features per map
100,000 features per month
1 million features per month
Reference layers
10 reference layers that contain U.S. demographics
Access to Esri Living Atlas maps and layers (feature services)
Publicly shared feature layers on ArcGIS
Publicly shared feature layers on ArcGIS
Infographics
Curated gallery of U.S. demographics variables (7 categories)
Full access to the ArcGIS GeoEnrichment data browser, including U.S. and global demographics variables
Source: https://doc.arcgis.com/en/maps-for-powerbi/get-started/about-plus.htm

More basic maps, more geo coordinates to map, more reference layers. The good thing about the reference layers is that you get exposed to the same set of publicly shared feature layers in both free and paid versions of the map, which is a cool thing! There is a search field in the Edit mode of the map that allows you to select and view one of those public layers within your map.

So now, knowing what my limits are, I switched my map theme to clustering and realized that it also had another limit that is not listed on the official ESRI site, the clustering feature in my case was limited to 10,000 geo points vs. the actual dataset of 10,614 stop locations. You can check those numbers if you like.




I still hope that ESRI tech support team will get back to me with my inquiry, otherwise, I will need to find a way to cancel my Plus subscription.

Happy data adventures!


(2018-July-29) Update: the ESRI support team finally reached out to me and helped to resolve the issue, please read my 2nd part of this story: http://datanrg.blogspot.com/2018/07/arcgis-maps-for-power-bi-plus.html