Search This Blog

Tuesday, August 8, 2017

Data modeling in Power BI using XML file with Parent and Children nodes

[2017-Aug-08] Support for XML structured files as a data source has been available in Excel, then was introduced into the Power Query data extract engine of the same office application which later migrated into the Power BI data modeling tool.

My recent experience working with XML data files at the last Toronto Open Data user group meeting - https://www.meetup.com/opentoronto/events/240911274/ challenged me to look at XML files at a different angle. 

A request was made to analyze Toronto Lobbyist Registry Open Data Set with a very common scenario for an XML data file to hold parent (or master) node elements along with a subset of children (or nested nodes).

Here is an example of this Toronto Lobbyist Registry open XML data file:


And here is how this file looked in the Power BI Query Editor:


Basically, the ROW parent node had a following list of elements:
SMNumber
- Status
- Type
- SubjectMatter
- Particulars
- InitialApprovalDate
- EffectiveDate
- ProposedStartDate
- ProposedEndDate

and several sub-nodes which are shown in Power BI query editor as tables:
- Registrant
- Communications
- etc.

My challenge was to build a data model that would enable these main dataset elements analysis along with its subset data elements. It wasn't that intuitive but eventually proved that Power BI is a very decent data modeling environment.

So, first, I made my parent node as a base table with SMNumber as its Primary Key (PK). 
Then for each of the sub-node datasets: 
1) I had to create a replica of the base table; let's take Communications table for example. 
2) Then I removed every column except for the actual column with the Communications table data and I also left the SMNumber, which now had become a Foreign Key (FK) to my base table PK.


3) Then expanded Communications table columns and gave a proper name to a new table within my data model, which now showed all the evidence of a potential One-to-Many relationship between the base and this new table:


After repeating steps 1-2-3 for the other sub-nodes on my XML sourced main table I was able then to create a stable data model to accommodate my future reporting visualizations and data discovery:


One single XML file with parent and children nodes had been transformed into multiple tables, where Power BI took care of empty sub-node elements (please see the null value from the 2nd step of my child node transformation). In addition to that, a referential integrity between tables was in place and I could start working on my visualizations.


I understand that XML data source for Power BI is just a very small member of the family of other supported data sources, and they keep adding new more complicated data resources to stream your data analytics from. However, it's just another proof of how many blades this Power BI Swiss knife can have :)

Happy data adventures!

Monday, July 24, 2017

Calculating Travel Distance/Time in Power BI using Bing Map API

[2017-July-24] Map visualization in Power BI has drawn many data professionals into a field of сreating visual data stories with maps. It only takes a dataset of Geo located data points to do this and everything else is managed by map visual controls plus one's imagination. You have a choice to use native maps in Power BI, shape maps, filled maps, flow maps, etc. Here is a link to a recent David Eldersveld article that can give a glimpse into those maps and what you can do with them, 10 Ways to Create Maps in Microsoft Power BI.

However, there is no built-in functionality yet in Power BI that would calculate travel distance and time between Geo points. Let's say, a business owner wants to know how many customers he could deliver his product to within 5 minutes, 10 minutes, time intervals could go on.

In order to solve this problem, we can use Bing Map API resource to geocode addresses, calculate routes, get information about traffic incidents and issues in a specified area and many more things: Bing Maps REST Services.

I was inspired by Dustin Ryan's article (https://sqldusty.com/2016/04/26/power-bi-and-the-bing-maps-api/) where he explored a Geo location API using textual address information. And I went a bit further and used http://dev.virtualearth.net/REST/V1/Routes/Driving? API to calculate a route between two locations, fortunately, it can provide me with a travel time as well.

Here is an XML output from a simple web service request to calculate a route between Boston International Airport (1 Harborside Dr, Boston, MA 02128, USA) and MIT (77 Massachusetts Ave, Cambridge, MA 02139, USA) 



Which then I can replicate as a web based data source in my Power BI data model to return Travel Distance and Duration data elements:



In addition to this, I can create a function in Power BI with 2 parameters, Start and Destinations address and then use it along with my data set:



let 
    Route = (StartAddress as text, DestinationAddres as text) =>
let
    Source = Xml.Tables(Web.Contents("http://dev.virtualearth.net/REST/V1/Routes/Driving?o=xml&wp.0="&StartAddress&"&wp.1="&DestinationAddres &"&avoid=minimizeTolls&key=BingMapsKey")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Copyright", type text}, {"BrandLogoUri", type text}, {"StatusCode", Int64.Type}, {"StatusDescription", type text}, {"AuthenticationResultCode", type text}, {"TraceId", type text}}),
    ResourceSets = #"Changed Type"{0}[ResourceSets],
    ResourceSet = ResourceSets{0}[ResourceSet],
    #"Changed Type1" = Table.TransformColumnTypes(ResourceSet,{{"EstimatedTotal", Int64.Type}}),
    Resources = #"Changed Type1"{0}[Resources],
    Route = Resources{0}[Route],
    #"Changed Type2" = Table.TransformColumnTypes(Route,{{"Id", type text}, {"DistanceUnit", type text}, {"DurationUnit", type text}, {"TravelDistance", type number}, {"TravelDuration", Int64.Type}, {"TravelDurationTraffic", Int64.Type}, {"TrafficDataUsed", type text}, {"TrafficCongestion", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"Id", "BoundingBox", "DistanceUnit", "DurationUnit", "TrafficDataUsed", "TrafficCongestion", "RouteLeg"})
in
    #"Removed Columns"
in 
    Route


With this function, I can then calculate travel metrics from the Boston Airport to 9 other Geo locations using the following dataset:



And then I can get my Travel Distance/Time results in Power BI by calling this new function for each of the dataset's rows:




With those calculations results, I can then place them on my map visualization and use highlighted [Time Duration] slicer to control the visibility of location points depending on their travel distance from the Boston Internation Airport.




I have also created two other M functions in Power BI to help with Geo coding and Address formatting. Feel free to use them in your geo data discovery, but please make sure to update them with your own Bing Maps Key that you're going to get after a registration to use Bing Map resource.

fnAddressLocate function takes textual address as a parameter and returns Latitude and Longitude coordinates for this address location:
let 
    AddressToLocate = (address as text) =>
let
    Source = Xml.Tables(Web.Contents("http://dev.virtualearth.net/REST/V1/Locations/"&address&"?o=xml&key=BingMapsKey")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Copyright", type text}, {"BrandLogoUri", type text}, {"StatusCode", Int64.Type}, {"StatusDescription", type text}, {"AuthenticationResultCode", type text}, {"TraceId", type text}}),
    ResourceSets = #"Changed Type"{0}[ResourceSets],
    ResourceSet = ResourceSets{0}[ResourceSet],
    #"Changed Type1" = Table.TransformColumnTypes(ResourceSet,{{"EstimatedTotal", Int64.Type}}),
    Resources = #"Changed Type1"{0}[Resources],
    Location = Resources{0}[Location],
    #"Changed Type2" = Table.TransformColumnTypes(Location,{{"Name", type text}, {"EntityType", type text}, {"Confidence", type text}, {"MatchCode", type text}}),
    Point = #"Changed Type2"{0}[Point]
in
    Point
in 
    AddressToLocate

fnFormatAddress function can help you to cleanse and format an address for better Geo Location. 
For example, it can change this address line "1 Harborse D, Bostn, 02128" into this formatted/corrected version "1 Harborside Dr, Boston, MA 02128"
let 
    AddressToFormat = (address as text) =>
let
    Source = Xml.Tables(Web.Contents("http://dev.virtualearth.net/REST/V1/Locations/"&address&"?o=xml&key=BingMapsKey")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Copyright", type text}, {"BrandLogoUri", type text}, {"StatusCode", Int64.Type}, {"StatusDescription", type text}, {"AuthenticationResultCode", type text}, {"TraceId", type text}}),
    ResourceSets = #"Changed Type"{0}[ResourceSets],
    ResourceSet = ResourceSets{0}[ResourceSet],
    #"Changed Type1" = Table.TransformColumnTypes(ResourceSet,{{"EstimatedTotal", Int64.Type}}),
    Resources = #"Changed Type1"{0}[Resources],
    Location = Resources{0}[Location],
    #"Changed Type2" = Table.TransformColumnTypes(Location,{{"Name", type text}, {"EntityType", type text}, {"Confidence", type text}, {"MatchCode", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"Point", "BoundingBox", "EntityType", "Address", "Confidence", "MatchCode", "GeocodePoint"})
in
    #"Removed Columns"
in 
    AddressToFormat

In overall, I really liked enriching my Power BI  data model with web requests to Bing Map APIs. The only thing that I didn't like was a response time. On a set of 10 records, it took approximately  5 seconds for the refresh, however, I've noticed that on larger datasets it might take more time per one record to call and process web request results.

What can be done to speed up this recalculation process may be discussed in another blog post. But in the meantime, feel free to comment if you like the described approach for Travel Distance/Time calculation in Power BI using Bing Map API.

Sunday, June 25, 2017

Geo Location of Twitter messages in Power BI

[2017-June-25] There are several custom solutions that make Twitter messages analysis possible in Power BI. And with a recent addition of the Campaign/Brand Management for Twitter solution template for Power BI (https://powerbi.microsoft.com/en-us/solution-templates/brand-management-twitter/) it's getting even less complicated to get into this social media mainstream data exploration.

Here is the architecture of this solution template, where your Microsoft Azure subscription and Twitter account will provide you with access to analytical data based on Twitter handlers and hashtags of your choice. 


- Logic Apps pull data from Twitter
- Logic App passes tweet into an Azure Function
- Azure Function enriches tweet and writes it to Azure SQL
- Azure Function also calls Azure ML experiment running Python script to work out sentiment of tweet
- Power BI imports data into it from Azure SQL and renders pre-defined reports


You can find more details on how to customize this solution template here: Brand & Campaign Management Solution Template Documentation

However, with my recent customer Power BI project implementation with this template, a request was made to add a filter for twitter messages based on their geo locations; customer wanted to see tweets only from the Canadian Twitter audience. Then I realized that all this information was not easy to obtain, because it's optional for Twitter users to populate it and even update their profile location is also optional (Twitter Geo Guidelines). But I still decided to try parsing user profile location in order to separate tweets from the Canadian users and users from other countries.

For this blog post I have used @canada150th twitter account messages and messages from all other accounts that contained #canada150 hashtag as well.

Here is my final report:




And here is how I was able to add twitter user profile location data into my report. 

1) I added a new location column to the table that holds all the processed tweet message in my Azure SQL database:
alter table [pbist_twitter].[tweets_processed] add location nvarchar(100);


2) Made a few changes in the code of the Azure function (TweetFunctionCSharp.cs) that processes tweet messages:

    private Dictionary<string, string> processedTweets = new Dictionary<string, string>()
        {
            {"tweetid", null},
            {"masterid", null},
            {"image_url", null},
            {"dateorig", null},
            {"authorimage_url", null},
            {"username", null},
            {"hourofdate", null},
            {"minuteofdate", null},
            {"direction", "Text"},
            {"favorited", "1"},
            {"retweet", "False"},
            {"user_followers", null},
            {"user_friends", null},
            {"user_favorites", null},
            {"user_totaltweets", null},
            {"location", null}
        };


processedTweets["location"] = tweet.UserDetails.Location;


3) Created a function in the SQL Azure database that would try to parse a text from the user profile location column and check if that profile belongs to Canada:

CREATE FUNCTION [dbo].[ufnCheckTwitterProfileCountry]
(
    @location nvarchar(100)
)
RETURNS nvarchar(30)
AS
BEGIN
        -- Declare the return variable here
    DECLARE @Country nvarchar(30)
 SELECT @Country = 'Non-Canada'

 DECLARE @TempTable as TABLE (location nvarchar(100));
 INSERT INTO @TempTable (location)
 select ' AB' as Location
 union select ' BC'
 union select ' MB'
 union select ' NB'
 union select ' NL'
 union select ' NS'
 union select ' NT'
 union select ' NU'
 union select ' ON'
 union select ' PE'
 union select ' QC'
 union select ' SK'
 union select ' YT'
 union select ',AB'
 union select ',BC'
 union select ',MB'
 union select ',NB'
 union select ',NL'
 union select ',NS'
 union select ',NT'
 union select ',NU'
 union select ',ON'
 union select ',PE'
 union select ',QC'
 union select ',SK'
 union select ',YT'
 union select 'Alberta'
 union select 'British Columbia'
 union select 'Manitoba'
 union select 'New Brunswick'
 union select 'Newfoundland and Labrador'
 union select 'Nova Scotia'
 union select 'Northwest Territories'
 union select 'Nunavut'
 union select 'Ontario'
 union select 'Prince Edward Island'
 union select 'Quebec'
 union select 'Québec'
 union select 'Saskatchewan'
 union select 'Yukon'
 union select 'Canada'

    -- Add the T-SQL statements to compute the return value here
    SELECT @Country = 'Canada'
 WHERE 
 (
 SELECT SUM(CHARINDEX(t.location, @location)) 
 from @TempTable t
 ) > 0;
 
    -- Return the result of the function
 RETURN @Country
END
GO


4) Updated a SQL view code to bring a Country code that is used in my Power BI report:

CREATE VIEW [pbist_twitter].[vw_tweets_processed]
AS
    SELECT tweetid    AS [Tweet Id],
           dateorig    AS [Original Date],
           Convert(date,[dateorig])  AS [Date],
           hourofdate    AS [Hours],
           minuteofdate    AS [Minutes],
           latitude    AS [Latitude],
           longitude    AS [Longitude],
           masterid    AS [Master Id],
           retweet    AS [Retweet],
           username    AS [Username],
           usernumber    AS [User Number],
           image_url    AS [Image URL],
           authorimage_url   AS [Author Image URL],
           direction    AS [Direction],
           favorited    AS [Favorited],
           user_followers   AS [User Followers],
           user_friends    AS [User Friends],
           user_favorites   As [User Favourites],
           user_totaltweets   AS [User Total Tweets]
 ,location
 ,dbo.ufnCheckTwitterProfileCountry(location) as Country
    FROM   pbist_twitter.tweets_processed;
GO


5) And now, with the hope that all Twitter users in my solution would have their profile location populated in addition to another hope, that this location free text field would be correctly parsed by my SQL function, I can add a Country filter to the Power BI report, which you could see at the very beginning of this blog post in all of its 6 pages.

Problem solved! Feel free to comment if you find this approach to add location details to your Power BI Twitter template solution helpful!

Happy 150th Canada day and happy data adventures!

Monday, May 8, 2017

Power BI with Narrative Science: Look Who's Talking (Part 3: On or Off premises)

[2017-May-08] Listening to a car radio with many different talk shows available, it comes more often to my mind, that certain segments of those radio stations broadcasting would eventually be substituted/automated by machines. I don't think it will happen overnight, but the voice natural language generation software is a reality and it's very near! Weather and road traffic news could be the first ones to replace, the data is already available and AI can consume it with different levels of verbosity and variety, thus it will speak like a human.


Natural language generation software already writes reports and creates communication based on data; several companies develop AI products that are able to consume/absorb user data and automatically explain what meaning of this data using natural human language. I've written two blog posts about the Narrative Science extension for Power BI:
- Power BI with Narrative Science: Look Who's Talking? 
- Power BI with Narrative Science: Look Who's Talking (Part 2: Predicting the Future) 

This time I want to briefly write about an option of using Narrative Science extension for Power BI on-premises. This will be very helpful for an organization that works with sensitive datasets and expects to keep their data secure within its own network environment.

This is how a free extension of the Narratives for Power BI works:



After placing a Narrative Science Power BI extension to your report and assigning specific dataset, then this data is transmitted to the Narrative Science platform over internet; where it gets processed and narratives are being generated.  Which then are transferred back to the Power BI report in a form of a natural human communication. Please read here how Narrative Science protects your data in this case: How does Narrative Science protect client data?


There is a 2nd option where Narrative Generation Application and Licensing Agent are hosted on separate servers. 
No client-specific data leaves your network, but internet connectivity is required for licensing:




Here is an extract from the Narrative Science  hardware/software requirements for both  Narrative Generation Application and Licensing Agent (https://powerbi-support.narrativescience.com/hc/en-us/articles/215191443-On-Premises-Installation-Prerequisites-):

Hardware/Server Recommendations:
Linux Server (Examples: Virtual machine, Azure instance) with the minimum system specifications:
1.    Number of Cores – 2
2.    Memory - 2GB per core
3.    Disk - 20GB
If you are installing the Narrative Science licensing agent on a separate server from the narrative generation application, the minimum system specifications for the agent server would be:
1.    Number of Cores – 1
2.    Memory - 2GB
3.    Disk - 10GB
Software Requirements: Operating System - Linux CentOS 7 or RHEL 7 (with systemd)


After setting up both servers you will need to activate license to Narratives for Power BI.



There is one drawback with the licensing though. The number of Narrative Sciences licences purchased must equal to the number of Power BI Pro licenses that you have within your organization, regardless of the actual number of users who plan on working with the Narratives. Perhaps this would be changed in future, especially with the recent changes to  Microsoft Power BI licensing itself.

You can always check the Narrative Science web site (https://www.narrativescience.com/) and contact them directly for more information on specifications and pricing details.

Happy data adventures!

Sunday, April 30, 2017

DAX calculations with hierarchies: Set the order straight.

[2017-Apr-30] If you have ever tried to create calculated measures in SSAS Tabular/ Power BI models where different levels of your model hierarchy had to be considered then you'd understand how DAX language doesn't have a straight way to identify parent & child levels in order to compute necessary metric results. There is a good article http://www.sqlbi.com/articles/clever-hierarchy-handling-in-dax/ by Alberto Ferrari where he explains a workaround approach to handle hierarchies in DAX.

I made one step further, based on my recent client project experience, and created a new example with one extra level for the Product dimension using Adventure Works Internet Sales tabular database. Here is the code to calculate Sales Ratio to a Parent level for all levels (Category, Sub Category, Model and Product Name)

RatioToParent:=
IF (
    ISFILTERED ( Product[HCategory] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) ),
    IF (
        ISFILTERED ( Product[HSubcategory] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HSubcategory] ) ),
        IF (
            ISFILTERED ( Product[HModel] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HModel] )
                ),
            IF (
                ISFILTERED ( Product[HProduct] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) )
            )
        )
    )
)

However, when I looked at the results for this new RatioToParent measure using Excel, I saw some weird values: only the to Category level showed me correct calculations and all other Product hierarchy levels had 100%.




Then I changed the order of levels in my DAX calculated measure from (Category > Sub Category > Model > Product Name) to (Product Name >  Model > Sub Category Category):

RatioToParent_Ordered:=
IF (
    ISFILTERED ( Product[HProduct] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) ),
    IF (
        ISFILTERED ( Product[HModel] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HModel] ) ),
        IF (
            ISFILTERED ( Product[HSubcategory] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HSubcategory] )
                ),
            IF (
                ISFILTERED ( Product[HCategory] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) )
            )
        )
    )
)

 and that made the whole difference and resolved the issue of miscalculated results:




So my learning lesson in this exercise was that order of hierarchical levels in DAX calculations is very important. Otherwise, calculations can only be correctly fulfilled for the top parent level of your tabular model hierarchy (please see the first DAX calculations). Lesson learned, mistakes not to be repeated!

Happy data adventures!


Friday, March 31, 2017

SSIS Web Service Task: misleading role of WSDL file

[2017-Mar-31] You can call one of your web service methods in SSIS package using a .NET script task or you can just use an existing Web Service task. The latter option is easier, UI based, doesn't take too much time to set up and Input/Output parameters are very intuitive to navigate.

There is one gotcha with the use of WSDL file though that is not fully documented but is worth to be mentioned. When you initially setup a Web Service task in SSIS, you need to provide a web server URL for your Http connection (let's just take this for example: http://www.webservicex.net/Statistics.asmx?WSDL), then you need to locate a local WSDL web service reference file with all available methods within. Then you test your ETL packages, it works well and all is merry and bright. However if you think that by changing the Http connection for your web service (let's say from Dev to Prod) you will be calling a Production version of your web service method then, actually it's not the case. It still going to be your Dev environment. I've learned this the hard way at my recent BI project, testing a web service call in UAT environment but for unknown reasons I was still getting results from the original Dev environment.

<wsdl:service name="Statistics">
     <wsdl:port name="StatisticsSoap" binding="tns:StatisticsSoap">
         <soap:address location="http://www.webservicex.net/Statistics.asmx"/>
     </wsdl:port>
</wsdl:service>

Apparently an SSIS Web Service task uses WSDL file defined connection to your web server stored in the soap:address tag regardless of what your Http connection would specify your web server location is.

My way to resolve this problem was to create a separate WSDL file for my UAT environment web service and make it a part of SSIS environment configuration, i.e. each different environment Htpp web service connection would be coupled with a corresponding WSDL file in order to prevent data result confusion during SSIS control flow execution, and don't forget to make those two parameters configurable.

Please let me know how you had to resolve a similar issue with the use of SSIS Web Service task after you deploy your ETL package to a different environment.




Tuesday, February 21, 2017

External SQL files vs. Stored procedures in SSIS development

[2017-Feb-21] I'm creating this particular blog post more for myself to remember and use it as a reference to related discussions in future.

One of my client's existent SSIS solution contains a use of multiple SQL execute tasks which are configured to use external files with a complicated SQL code. Such solution architecture design is not very sustainable and my recommendation was to replace those external SQL code files with a set of stored procedures on a SQL Server side.

This client further questions to outline benefits of using stored procedures vs. external files, along with preferable way to use hard coded SQL code vs. stored procedure calls in SSIS packages has led me to write this.

Stability and Security issues:
- By keeping data extract logic in external SQL files will make SSIS data flows tasks more vulnerable to a risk of undesirable SQL scripts change or even worse to a risk of deletion those files which may fail the whole data flow process. And by depositing the same SQL extract logic into stored procedures you now rely on a very stable SQL Server engine to execute your expected set based data mechanism.
- Security wise, externals SQL files requires an extra safekeeping methodology to prevent those files from harmful changes: folder with those scripts will have to be properly secured and additional security policy will have to be defined and maintained. With SQL logic in SQL Server stored procedure you don’t need to rely on something external to protect your data flow task: an existent application service account that already has read/write access to corresponding databases in SQL server will only need to be granted with execute permission for new SQL Server procedures.

Basically, porting SQL code from external files into stored procedures on a SQL Server will prevent your data driven application from a so called "SQL injection" security threat. When both your data and complex data load logic lives in a SQL Server environment, you feel safe and secure!

Recommendations for using SQL code in SSIS:
- There is no right or wrong of using a hard coded SQL code within a SSIS package, in some case it’s valid approach to run some simple SQL commands against a database engine. However when a logic of those SQL commands becomes complicated, it’s recommended to wrap this code into SQL Server procedures and then just execute them within your SSIS control/data flows.
- SQL Server procedure becomes then a better option from a development prospective with a rich set of tools to design and test your data extract code.  Plus, with any further changes you won’t need to re-deploy your SSIS packages and only a code of your stored procedure will have to be updated. 

So using stored procedures for SSIS development in most cases it’s a win-win solution for a developer: it saves development and deployment time for your SSIS solution!