Search This Blog

Tuesday, October 31, 2017

Book from a library. Learning never stops for data professionals.

[2017-Oct-31] If you have never read the "Iacocca: An Autobiography" book by Lee Iacocca, I highly recommend it. I read it first when I was 17 years old, walked into a local library in my search for any computer related books. By accident, found it on a bookshelf, read a couple of pages, was told that I couldn't borrow the book but was allowed to come and read it in the library. My next week was filled with personal stories from Lee Iacocca's book, I even carried a notebook to take notes.




It has been an interesting reading about Lee Iacocca's early life, his relationship with his father, and how then Lee began his career at Ford and then continued his passionate work at Chrysler auto company. Going back to Lee's father, the book is filled with his quotes and phrases, no wonder I had a notebook to write them down. And there was one his phrase that had planted deep roots in my memory. Where he said, "When times get tough, get into a food business because people will always need to eat".




During a recent Power BI World Tour in Toronto (https://www.pbiusergroup.com/pbiworldtour/locations/wt-toronto, Oct 4-5, 2017) I retold my story with the book and slightly paraphrased this quote during my Power BI presentation, by telling that we need to get ourselves into a data business because people will always need information to use. You could be a data analyst crunching big numbers or a basketball coach going through a very detailed report filled with statistics data from a previous game. And I believe that ability to know how to work with data will always be needed.

Data can be found everywhere and we can convert almost everything into a form of measurable facts. Power BI can help us to connect to this ever expanding world of data and enrich our information discoveries through the existing data connectors (file-based datasets, different databases, Azure and other online services). And this list of supported data connectors for Power BI keeps growing.



But I would also encourage to become more technology agnostic IT professionals by spending time mastering more sustainable skills: data modeling, effective data visualization, data science, etc. Because we need to get ourselves into a data business, that's why learning for data professionals never stops!

Happy data adventures!

Friday, September 1, 2017

Macros in ER/Studio Data Architect is your data modeling lifebelt

[2017-Sep-01] There is a well-known phrase, "the rescue of a drowning man is the drowning man's own job" from one of the best books by Ilf and Petrov  "The Twelve Chairs". And if you have ever used macros with the ER/Studio during your data modeling, then you may see how they can be your lifebelt among a large volume of your database design tasks and operations.

ER/Studio Data Architect is a great software tool with extensive data modeling capabilities across multiple relational and big data platforms and macros could automate repetitive tasks and save your time.



My favorites macros are:

1) Import Columns From Excel
2) Import/Export Domains From Excel
3) Name Foreign Constraints
4) Name Primary Constraints

The first two macros are a great time saver for designing tables with a big number of columns and the other two could help you to keep your entity object names aligned with a single naming convention.

Let's briefly take a look at each of those macros.

Import Columns From Excel
Your Excel files with a list of entity attributes should comply with the following structure:
1. Entity Name
2. Table Name
3. Attribute Name
4. Column Name
5. Data type
6. Table Definition
7. Attribute Definition
8. Nullability - 'NULL' or 'NOT NULL'
9. Primary Key - 'True' or 'False

Then when you choose to execute this macro a dialog box appears for you to upload your file. Once you finish this task, ER Studio could add or update attributes to the existing entity of your data model or create a new entity.


Import/Export Domains From Excel
This macro exports the domains and their properties to Excel. Each record in the Excel spread sheet represents domains from the data dictionary for all or selected entities. The domain properties can be further updated and imported back into an existing model or a new model using the "Import Domains From Excel" macro.





Name Foreign Constraints
This macro will prompt the user with a dialog to specify the naming convention for Foreign Constraints.  It will then name all the constraints using the parent and child table names. It will also make sure the name is unique by adding an index for the last characters of duplicate names.



Name Primary Constraints
This macro will name all primary key constraints with the given naming conventions. The table name with either a prefix or suffix.


I hope you will find your lifebelt next time when you do your data modeling in ER/Studio Architect!

Tuesday, August 8, 2017

Data modeling in Power BI using XML file with Parent and Children nodes

[2017-Aug-08] Support for XML structured files as a data source has been available in Excel, then was introduced into the Power Query data extract engine of the same office application which later migrated into the Power BI data modeling tool.

My recent experience working with XML data files at the last Toronto Open Data user group meeting - https://www.meetup.com/opentoronto/events/240911274/ challenged me to look at XML files at a different angle. 

A request was made to analyze Toronto Lobbyist Registry Open Data Set with a very common scenario for an XML data file to hold parent (or master) node elements along with a subset of children (or nested nodes).

Here is an example of this Toronto Lobbyist Registry open XML data file:


And here is how this file looked in the Power BI Query Editor:


Basically, the ROW parent node had a following list of elements:
SMNumber
- Status
- Type
- SubjectMatter
- Particulars
- InitialApprovalDate
- EffectiveDate
- ProposedStartDate
- ProposedEndDate

and several sub-nodes which are shown in Power BI query editor as tables:
- Registrant
- Communications
- etc.

My challenge was to build a data model that would enable these main dataset elements analysis along with its subset data elements. It wasn't that intuitive but eventually proved that Power BI is a very decent data modeling environment.

So, first, I made my parent node as a base table with SMNumber as its Primary Key (PK). 
Then for each of the sub-node datasets: 
1) I had to create a replica of the base table; let's take Communications table for example. 
2) Then I removed every column except for the actual column with the Communications table data and I also left the SMNumber, which now had become a Foreign Key (FK) to my base table PK.


3) Then expanded Communications table columns and gave a proper name to a new table within my data model, which now showed all the evidence of a potential One-to-Many relationship between the base and this new table:


After repeating steps 1-2-3 for the other sub-nodes on my XML sourced main table I was able then to create a stable data model to accommodate my future reporting visualizations and data discovery:


One single XML file with parent and children nodes had been transformed into multiple tables, where Power BI took care of empty sub-node elements (please see the null value from the 2nd step of my child node transformation). In addition to that, a referential integrity between tables was in place and I could start working on my visualizations.


I understand that XML data source for Power BI is just a very small member of the family of other supported data sources, and they keep adding new more complicated data resources to stream your data analytics from. However, it's just another proof of how many blades this Power BI Swiss knife can have :)

Happy data adventures!

Monday, July 24, 2017

Calculating Travel Distance/Time in Power BI using Bing Map API

[2017-July-24] Map visualization in Power BI has drawn many data professionals into a field of сreating visual data stories with maps. It only takes a dataset of Geo located data points to do this and everything else is managed by map visual controls plus one's imagination. You have a choice to use native maps in Power BI, shape maps, filled maps, flow maps, etc. Here is a link to a recent David Eldersveld article that can give a glimpse into those maps and what you can do with them, 10 Ways to Create Maps in Microsoft Power BI.

However, there is no built-in functionality yet in Power BI that would calculate travel distance and time between Geo points. Let's say, a business owner wants to know how many customers he could deliver his product to within 5 minutes, 10 minutes, time intervals could go on.

In order to solve this problem, we can use Bing Map API resource to geocode addresses, calculate routes, get information about traffic incidents and issues in a specified area and many more things: Bing Maps REST Services.

I was inspired by Dustin Ryan's article (https://sqldusty.com/2016/04/26/power-bi-and-the-bing-maps-api/) where he explored a Geo location API using textual address information. And I went a bit further and used http://dev.virtualearth.net/REST/V1/Routes/Driving? API to calculate a route between two locations, fortunately, it can provide me with a travel time as well.

Here is an XML output from a simple web service request to calculate a route between Boston International Airport (1 Harborside Dr, Boston, MA 02128, USA) and MIT (77 Massachusetts Ave, Cambridge, MA 02139, USA) 



Which then I can replicate as a web based data source in my Power BI data model to return Travel Distance and Duration data elements:



In addition to this, I can create a function in Power BI with 2 parameters, Start and Destinations address and then use it along with my data set:



let 
    Route = (StartAddress as text, DestinationAddres as text) =>
let
    Source = Xml.Tables(Web.Contents("http://dev.virtualearth.net/REST/V1/Routes/Driving?o=xml&wp.0="&StartAddress&"&wp.1="&DestinationAddres &"&avoid=minimizeTolls&key=BingMapsKey")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Copyright", type text}, {"BrandLogoUri", type text}, {"StatusCode", Int64.Type}, {"StatusDescription", type text}, {"AuthenticationResultCode", type text}, {"TraceId", type text}}),
    ResourceSets = #"Changed Type"{0}[ResourceSets],
    ResourceSet = ResourceSets{0}[ResourceSet],
    #"Changed Type1" = Table.TransformColumnTypes(ResourceSet,{{"EstimatedTotal", Int64.Type}}),
    Resources = #"Changed Type1"{0}[Resources],
    Route = Resources{0}[Route],
    #"Changed Type2" = Table.TransformColumnTypes(Route,{{"Id", type text}, {"DistanceUnit", type text}, {"DurationUnit", type text}, {"TravelDistance", type number}, {"TravelDuration", Int64.Type}, {"TravelDurationTraffic", Int64.Type}, {"TrafficDataUsed", type text}, {"TrafficCongestion", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"Id", "BoundingBox", "DistanceUnit", "DurationUnit", "TrafficDataUsed", "TrafficCongestion", "RouteLeg"})
in
    #"Removed Columns"
in 
    Route


With this function, I can then calculate travel metrics from the Boston Airport to 9 other Geo locations using the following dataset:



And then I can get my Travel Distance/Time results in Power BI by calling this new function for each of the dataset's rows:




With those calculations results, I can then place them on my map visualization and use highlighted [Time Duration] slicer to control the visibility of location points depending on their travel distance from the Boston Internation Airport.




I have also created two other M functions in Power BI to help with Geo coding and Address formatting. Feel free to use them in your geo data discovery, but please make sure to update them with your own Bing Maps Key that you're going to get after a registration to use Bing Map resource.

fnAddressLocate function takes textual address as a parameter and returns Latitude and Longitude coordinates for this address location:
let 
    AddressToLocate = (address as text) =>
let
    Source = Xml.Tables(Web.Contents("http://dev.virtualearth.net/REST/V1/Locations/"&address&"?o=xml&key=BingMapsKey")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Copyright", type text}, {"BrandLogoUri", type text}, {"StatusCode", Int64.Type}, {"StatusDescription", type text}, {"AuthenticationResultCode", type text}, {"TraceId", type text}}),
    ResourceSets = #"Changed Type"{0}[ResourceSets],
    ResourceSet = ResourceSets{0}[ResourceSet],
    #"Changed Type1" = Table.TransformColumnTypes(ResourceSet,{{"EstimatedTotal", Int64.Type}}),
    Resources = #"Changed Type1"{0}[Resources],
    Location = Resources{0}[Location],
    #"Changed Type2" = Table.TransformColumnTypes(Location,{{"Name", type text}, {"EntityType", type text}, {"Confidence", type text}, {"MatchCode", type text}}),
    Point = #"Changed Type2"{0}[Point]
in
    Point
in 
    AddressToLocate

fnFormatAddress function can help you to cleanse and format an address for better Geo Location. 
For example, it can change this address line "1 Harborse D, Bostn, 02128" into this formatted/corrected version "1 Harborside Dr, Boston, MA 02128"
let 
    AddressToFormat = (address as text) =>
let
    Source = Xml.Tables(Web.Contents("http://dev.virtualearth.net/REST/V1/Locations/"&address&"?o=xml&key=BingMapsKey")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Copyright", type text}, {"BrandLogoUri", type text}, {"StatusCode", Int64.Type}, {"StatusDescription", type text}, {"AuthenticationResultCode", type text}, {"TraceId", type text}}),
    ResourceSets = #"Changed Type"{0}[ResourceSets],
    ResourceSet = ResourceSets{0}[ResourceSet],
    #"Changed Type1" = Table.TransformColumnTypes(ResourceSet,{{"EstimatedTotal", Int64.Type}}),
    Resources = #"Changed Type1"{0}[Resources],
    Location = Resources{0}[Location],
    #"Changed Type2" = Table.TransformColumnTypes(Location,{{"Name", type text}, {"EntityType", type text}, {"Confidence", type text}, {"MatchCode", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"Point", "BoundingBox", "EntityType", "Address", "Confidence", "MatchCode", "GeocodePoint"})
in
    #"Removed Columns"
in 
    AddressToFormat

In overall, I really liked enriching my Power BI  data model with web requests to Bing Map APIs. The only thing that I didn't like was a response time. On a set of 10 records, it took approximately  5 seconds for the refresh, however, I've noticed that on larger datasets it might take more time per one record to call and process web request results.

What can be done to speed up this recalculation process may be discussed in another blog post. But in the meantime, feel free to comment if you like the described approach for Travel Distance/Time calculation in Power BI using Bing Map API.

Sunday, June 25, 2017

Geo Location of Twitter messages in Power BI

[2017-June-25] There are several custom solutions that make Twitter messages analysis possible in Power BI. And with a recent addition of the Campaign/Brand Management for Twitter solution template for Power BI (https://powerbi.microsoft.com/en-us/solution-templates/brand-management-twitter/) it's getting even less complicated to get into this social media mainstream data exploration.

Here is the architecture of this solution template, where your Microsoft Azure subscription and Twitter account will provide you with access to analytical data based on Twitter handlers and hashtags of your choice. 


- Logic Apps pull data from Twitter
- Logic App passes tweet into an Azure Function
- Azure Function enriches tweet and writes it to Azure SQL
- Azure Function also calls Azure ML experiment running Python script to work out sentiment of tweet
- Power BI imports data into it from Azure SQL and renders pre-defined reports


You can find more details on how to customize this solution template here: Brand & Campaign Management Solution Template Documentation

However, with my recent customer Power BI project implementation with this template, a request was made to add a filter for twitter messages based on their geo locations; customer wanted to see tweets only from the Canadian Twitter audience. Then I realized that all this information was not easy to obtain, because it's optional for Twitter users to populate it and even update their profile location is also optional (Twitter Geo Guidelines). But I still decided to try parsing user profile location in order to separate tweets from the Canadian users and users from other countries.

For this blog post I have used @canada150th twitter account messages and messages from all other accounts that contained #canada150 hashtag as well.

Here is my final report:




And here is how I was able to add twitter user profile location data into my report. 

1) I added a new location column to the table that holds all the processed tweet message in my Azure SQL database:
alter table [pbist_twitter].[tweets_processed] add location nvarchar(100);


2) Made a few changes in the code of the Azure function (TweetFunctionCSharp.cs) that processes tweet messages:

    private Dictionary<string, string> processedTweets = new Dictionary<string, string>()
        {
            {"tweetid", null},
            {"masterid", null},
            {"image_url", null},
            {"dateorig", null},
            {"authorimage_url", null},
            {"username", null},
            {"hourofdate", null},
            {"minuteofdate", null},
            {"direction", "Text"},
            {"favorited", "1"},
            {"retweet", "False"},
            {"user_followers", null},
            {"user_friends", null},
            {"user_favorites", null},
            {"user_totaltweets", null},
            {"location", null}
        };


processedTweets["location"] = tweet.UserDetails.Location;


3) Created a function in the SQL Azure database that would try to parse a text from the user profile location column and check if that profile belongs to Canada:

CREATE FUNCTION [dbo].[ufnCheckTwitterProfileCountry]
(
    @location nvarchar(100)
)
RETURNS nvarchar(30)
AS
BEGIN
        -- Declare the return variable here
    DECLARE @Country nvarchar(30)
 SELECT @Country = 'Non-Canada'

 DECLARE @TempTable as TABLE (location nvarchar(100));
 INSERT INTO @TempTable (location)
 select ' AB' as Location
 union select ' BC'
 union select ' MB'
 union select ' NB'
 union select ' NL'
 union select ' NS'
 union select ' NT'
 union select ' NU'
 union select ' ON'
 union select ' PE'
 union select ' QC'
 union select ' SK'
 union select ' YT'
 union select ',AB'
 union select ',BC'
 union select ',MB'
 union select ',NB'
 union select ',NL'
 union select ',NS'
 union select ',NT'
 union select ',NU'
 union select ',ON'
 union select ',PE'
 union select ',QC'
 union select ',SK'
 union select ',YT'
 union select 'Alberta'
 union select 'British Columbia'
 union select 'Manitoba'
 union select 'New Brunswick'
 union select 'Newfoundland and Labrador'
 union select 'Nova Scotia'
 union select 'Northwest Territories'
 union select 'Nunavut'
 union select 'Ontario'
 union select 'Prince Edward Island'
 union select 'Quebec'
 union select 'Québec'
 union select 'Saskatchewan'
 union select 'Yukon'
 union select 'Canada'

    -- Add the T-SQL statements to compute the return value here
    SELECT @Country = 'Canada'
 WHERE 
 (
 SELECT SUM(CHARINDEX(t.location, @location)) 
 from @TempTable t
 ) > 0;
 
    -- Return the result of the function
 RETURN @Country
END
GO


4) Updated a SQL view code to bring a Country code that is used in my Power BI report:

CREATE VIEW [pbist_twitter].[vw_tweets_processed]
AS
    SELECT tweetid    AS [Tweet Id],
           dateorig    AS [Original Date],
           Convert(date,[dateorig])  AS [Date],
           hourofdate    AS [Hours],
           minuteofdate    AS [Minutes],
           latitude    AS [Latitude],
           longitude    AS [Longitude],
           masterid    AS [Master Id],
           retweet    AS [Retweet],
           username    AS [Username],
           usernumber    AS [User Number],
           image_url    AS [Image URL],
           authorimage_url   AS [Author Image URL],
           direction    AS [Direction],
           favorited    AS [Favorited],
           user_followers   AS [User Followers],
           user_friends    AS [User Friends],
           user_favorites   As [User Favourites],
           user_totaltweets   AS [User Total Tweets]
 ,location
 ,dbo.ufnCheckTwitterProfileCountry(location) as Country
    FROM   pbist_twitter.tweets_processed;
GO


5) And now, with the hope that all Twitter users in my solution would have their profile location populated in addition to another hope, that this location free text field would be correctly parsed by my SQL function, I can add a Country filter to the Power BI report, which you could see at the very beginning of this blog post in all of its 6 pages.

Problem solved! Feel free to comment if you find this approach to add location details to your Power BI Twitter template solution helpful!

Happy 150th Canada day and happy data adventures!

Monday, May 8, 2017

Power BI with Narrative Science: Look Who's Talking (Part 3: On or Off premises)

[2017-May-08] Listening to a car radio with many different talk shows available, it comes more often to my mind, that certain segments of those radio stations broadcasting would eventually be substituted/automated by machines. I don't think it will happen overnight, but the voice natural language generation software is a reality and it's very near! Weather and road traffic news could be the first ones to replace, the data is already available and AI can consume it with different levels of verbosity and variety, thus it will speak like a human.


Natural language generation software already writes reports and creates communication based on data; several companies develop AI products that are able to consume/absorb user data and automatically explain what meaning of this data using natural human language. I've written two blog posts about the Narrative Science extension for Power BI:
- Power BI with Narrative Science: Look Who's Talking? 
- Power BI with Narrative Science: Look Who's Talking (Part 2: Predicting the Future) 

This time I want to briefly write about an option of using Narrative Science extension for Power BI on-premises. This will be very helpful for an organization that works with sensitive datasets and expects to keep their data secure within its own network environment.

This is how a free extension of the Narratives for Power BI works:



After placing a Narrative Science Power BI extension to your report and assigning specific dataset, then this data is transmitted to the Narrative Science platform over internet; where it gets processed and narratives are being generated.  Which then are transferred back to the Power BI report in a form of a natural human communication. Please read here how Narrative Science protects your data in this case: How does Narrative Science protect client data?


There is a 2nd option where Narrative Generation Application and Licensing Agent are hosted on separate servers. 
No client-specific data leaves your network, but internet connectivity is required for licensing:




Here is an extract from the Narrative Science  hardware/software requirements for both  Narrative Generation Application and Licensing Agent (https://powerbi-support.narrativescience.com/hc/en-us/articles/215191443-On-Premises-Installation-Prerequisites-):

Hardware/Server Recommendations:
Linux Server (Examples: Virtual machine, Azure instance) with the minimum system specifications:
1.    Number of Cores – 2
2.    Memory - 2GB per core
3.    Disk - 20GB
If you are installing the Narrative Science licensing agent on a separate server from the narrative generation application, the minimum system specifications for the agent server would be:
1.    Number of Cores – 1
2.    Memory - 2GB
3.    Disk - 10GB
Software Requirements: Operating System - Linux CentOS 7 or RHEL 7 (with systemd)


After setting up both servers you will need to activate license to Narratives for Power BI.



There is one drawback with the licensing though. The number of Narrative Sciences licences purchased must equal to the number of Power BI Pro licenses that you have within your organization, regardless of the actual number of users who plan on working with the Narratives. Perhaps this would be changed in future, especially with the recent changes to  Microsoft Power BI licensing itself.

You can always check the Narrative Science web site (https://www.narrativescience.com/) and contact them directly for more information on specifications and pricing details.

Happy data adventures!

Sunday, April 30, 2017

DAX calculations with hierarchies: Set the order straight.

[2017-Apr-30] If you have ever tried to create calculated measures in SSAS Tabular/ Power BI models where different levels of your model hierarchy had to be considered then you'd understand how DAX language doesn't have a straight way to identify parent & child levels in order to compute necessary metric results. There is a good article http://www.sqlbi.com/articles/clever-hierarchy-handling-in-dax/ by Alberto Ferrari where he explains a workaround approach to handle hierarchies in DAX.

I made one step further, based on my recent client project experience, and created a new example with one extra level for the Product dimension using Adventure Works Internet Sales tabular database. Here is the code to calculate Sales Ratio to a Parent level for all levels (Category, Sub Category, Model and Product Name)

RatioToParent:=
IF (
    ISFILTERED ( Product[HCategory] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) ),
    IF (
        ISFILTERED ( Product[HSubcategory] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HSubcategory] ) ),
        IF (
            ISFILTERED ( Product[HModel] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HModel] )
                ),
            IF (
                ISFILTERED ( Product[HProduct] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) )
            )
        )
    )
)

However, when I looked at the results for this new RatioToParent measure using Excel, I saw some weird values: only the to Category level showed me correct calculations and all other Product hierarchy levels had 100%.




Then I changed the order of levels in my DAX calculated measure from (Category > Sub Category > Model > Product Name) to (Product Name >  Model > Sub Category Category):

RatioToParent_Ordered:=
IF (
    ISFILTERED ( Product[HProduct] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) ),
    IF (
        ISFILTERED ( Product[HModel] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HModel] ) ),
        IF (
            ISFILTERED ( Product[HSubcategory] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HSubcategory] )
                ),
            IF (
                ISFILTERED ( Product[HCategory] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) )
            )
        )
    )
)

 and that made the whole difference and resolved the issue of miscalculated results:




So my learning lesson in this exercise was that order of hierarchical levels in DAX calculations is very important. Otherwise, calculations can only be correctly fulfilled for the top parent level of your tabular model hierarchy (please see the first DAX calculations). Lesson learned, mistakes not to be repeated!

Happy data adventures!