Search This Blog

Sunday, June 25, 2017

Geo Location of Twitter messages in Power BI

[2017-June-25] There are several custom solutions that make Twitter messages analysis possible in Power BI. And with a recent addition of the Campaign/Brand Management for Twitter solution template for Power BI (https://powerbi.microsoft.com/en-us/solution-templates/brand-management-twitter/) it's getting even less complicated to get into this social media mainstream data exploration.

Here is the architecture of this solution template, where your Microsoft Azure subscription and Twitter account will provide you with access to analytical data based on Twitter handlers and hashtags of your choice. 


- Logic Apps pull data from Twitter
- Logic App passes tweet into an Azure Function
- Azure Function enriches tweet and writes it to Azure SQL
- Azure Function also calls Azure ML experiment running Python script to work out sentiment of tweet
- Power BI imports data into it from Azure SQL and renders pre-defined reports


You can find more details on how to customize this solution template here: Brand & Campaign Management Solution Template Documentation

However, with my recent customer Power BI project implementation with this template, a request was made to add a filter for twitter messages based on their geo locations; customer wanted to see tweets only from the Canadian Twitter audience. Then I realized that all this information was not easy to obtain, because it's optional for Twitter users to populate it and even update their profile location is also optional (Twitter Geo Guidelines). But I still decided to try parsing user profile location in order to separate tweets from the Canadian users and users from other countries.

For this blog post I have used @canada150th twitter account messages and messages from all other accounts that contained #canada150 hashtag as well.

Here is my final report:




And here is how I was able to add twitter user profile location data into my report. 

1) I added a new location column to the table that holds all the processed tweet message in my Azure SQL database:
alter table [pbist_twitter].[tweets_processed] add location nvarchar(100);


2) Made a few changes in the code of the Azure function (TweetFunctionCSharp.cs) that processes tweet messages:

    private Dictionary<string, string> processedTweets = new Dictionary<string, string>()
        {
            {"tweetid", null},
            {"masterid", null},
            {"image_url", null},
            {"dateorig", null},
            {"authorimage_url", null},
            {"username", null},
            {"hourofdate", null},
            {"minuteofdate", null},
            {"direction", "Text"},
            {"favorited", "1"},
            {"retweet", "False"},
            {"user_followers", null},
            {"user_friends", null},
            {"user_favorites", null},
            {"user_totaltweets", null},
            {"location", null}
        };


processedTweets["location"] = tweet.UserDetails.Location;


3) Created a function in the SQL Azure database that would try to parse a text from the user profile location column and check if that profile belongs to Canada:

CREATE FUNCTION [dbo].[ufnCheckTwitterProfileCountry]
(
    @location nvarchar(100)
)
RETURNS nvarchar(30)
AS
BEGIN
        -- Declare the return variable here
    DECLARE @Country nvarchar(30)
 SELECT @Country = 'Non-Canada'

 DECLARE @TempTable as TABLE (location nvarchar(100));
 INSERT INTO @TempTable (location)
 select ' AB' as Location
 union select ' BC'
 union select ' MB'
 union select ' NB'
 union select ' NL'
 union select ' NS'
 union select ' NT'
 union select ' NU'
 union select ' ON'
 union select ' PE'
 union select ' QC'
 union select ' SK'
 union select ' YT'
 union select ',AB'
 union select ',BC'
 union select ',MB'
 union select ',NB'
 union select ',NL'
 union select ',NS'
 union select ',NT'
 union select ',NU'
 union select ',ON'
 union select ',PE'
 union select ',QC'
 union select ',SK'
 union select ',YT'
 union select 'Alberta'
 union select 'British Columbia'
 union select 'Manitoba'
 union select 'New Brunswick'
 union select 'Newfoundland and Labrador'
 union select 'Nova Scotia'
 union select 'Northwest Territories'
 union select 'Nunavut'
 union select 'Ontario'
 union select 'Prince Edward Island'
 union select 'Quebec'
 union select 'Qu├ębec'
 union select 'Saskatchewan'
 union select 'Yukon'
 union select 'Canada'

    -- Add the T-SQL statements to compute the return value here
    SELECT @Country = 'Canada'
 WHERE 
 (
 SELECT SUM(CHARINDEX(t.location, @location)) 
 from @TempTable t
 ) > 0;
 
    -- Return the result of the function
 RETURN @Country
END
GO


4) Updated a SQL view code to bring a Country code that is used in my Power BI report:

CREATE VIEW [pbist_twitter].[vw_tweets_processed]
AS
    SELECT tweetid    AS [Tweet Id],
           dateorig    AS [Original Date],
           Convert(date,[dateorig])  AS [Date],
           hourofdate    AS [Hours],
           minuteofdate    AS [Minutes],
           latitude    AS [Latitude],
           longitude    AS [Longitude],
           masterid    AS [Master Id],
           retweet    AS [Retweet],
           username    AS [Username],
           usernumber    AS [User Number],
           image_url    AS [Image URL],
           authorimage_url   AS [Author Image URL],
           direction    AS [Direction],
           favorited    AS [Favorited],
           user_followers   AS [User Followers],
           user_friends    AS [User Friends],
           user_favorites   As [User Favourites],
           user_totaltweets   AS [User Total Tweets]
 ,location
 ,dbo.ufnCheckTwitterProfileCountry(location) as Country
    FROM   pbist_twitter.tweets_processed;
GO


5) And now, with the hope that all Twitter users in my solution would have their profile location populated in addition to another hope, that this location free text field would be correctly parsed by my SQL function, I can add a Country filter to the Power BI report, which you could see at the very beginning of this blog post in all of its 6 pages.

Problem solved! Feel free to comment if you find this approach to add location details to your Power BI Twitter template solution helpful!

Happy 150th Canada day and happy data adventures!

Monday, May 8, 2017

Power BI with Narrative Science: Look Who's Talking (Part 3: On or Off premises)

[2017-May-08] Listening to a car radio with many different talk shows available, it comes more often to my mind, that certain segments of those radio stations broadcasting would eventually be substituted/automated by machines. I don't think it will happen overnight, but the voice natural language generation software is a reality and it's very near! Weather and road traffic news could be the first ones to replace, the data is already available and AI can consume it with different levels of verbosity and variety, thus it will speak like a human.


Natural language generation software already writes reports and creates communication based on data; several companies develop AI products that are able to consume/absorb user data and automatically explain what meaning of this data using natural human language. I've written two blog posts about the Narrative Science extension for Power BI:
- Power BI with Narrative Science: Look Who's Talking? 
- Power BI with Narrative Science: Look Who's Talking (Part 2: Predicting the Future) 

This time I want to briefly write about an option of using Narrative Science extension for Power BI on-premises. This will be very helpful for an organization that works with sensitive datasets and expects to keep their data secure within its own network environment.

This is how a free extension of the Narratives for Power BI works:



After placing a Narrative Science Power BI extension to your report and assigning specific dataset, then this data is transmitted to the Narrative Science platform over internet; where it gets processed and narratives are being generated.  Which then are transferred back to the Power BI report in a form of a natural human communication. Please read here how Narrative Science protects your data in this case: How does Narrative Science protect client data?


There is a 2nd option where Narrative Generation Application and Licensing Agent are hosted on separate servers. 
No client-specific data leaves your network, but internet connectivity is required for licensing:




Here is an extract from the Narrative Science  hardware/software requirements for both  Narrative Generation Application and Licensing Agent (https://powerbi-support.narrativescience.com/hc/en-us/articles/215191443-On-Premises-Installation-Prerequisites-):

Hardware/Server Recommendations:
Linux Server (Examples: Virtual machine, Azure instance) with the minimum system specifications:
1.    Number of Cores – 2
2.    Memory - 2GB per core
3.    Disk - 20GB
If you are installing the Narrative Science licensing agent on a separate server from the narrative generation application, the minimum system specifications for the agent server would be:
1.    Number of Cores – 1
2.    Memory - 2GB
3.    Disk - 10GB
Software Requirements: Operating System - Linux CentOS 7 or RHEL 7 (with systemd)


After setting up both servers you will need to activate license to Narratives for Power BI.



There is one drawback with the licensing though. The number of Narrative Sciences licences purchased must equal to the number of Power BI Pro licenses that you have within your organization, regardless of the actual number of users who plan on working with the Narratives. Perhaps this would be changed in future, especially with the recent changes to  Microsoft Power BI licensing itself.

You can always check the Narrative Science web site (https://www.narrativescience.com/) and contact them directly for more information on specifications and pricing details.

Happy data adventures!

Sunday, April 30, 2017

DAX calculations with hierarchies: Set the order straight.

[2017-Apr-30] If you have ever tried to create calculated measures in SSAS Tabular/ Power BI models where different levels of your model hierarchy had to be considered then you'd understand how DAX language doesn't have a straight way to identify parent & child levels in order to compute necessary metric results. There is a good article http://www.sqlbi.com/articles/clever-hierarchy-handling-in-dax/ by Alberto Ferrari where he explains a workaround approach to handle hierarchies in DAX.

I made one step further, based on my recent client project experience, and created a new example with one extra level for the Product dimension using Adventure Works Internet Sales tabular database. Here is the code to calculate Sales Ratio to a Parent level for all levels (Category, Sub Category, Model and Product Name)

RatioToParent:=
IF (
    ISFILTERED ( Product[HCategory] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) ),
    IF (
        ISFILTERED ( Product[HSubcategory] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HSubcategory] ) ),
        IF (
            ISFILTERED ( Product[HModel] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HModel] )
                ),
            IF (
                ISFILTERED ( Product[HProduct] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) )
            )
        )
    )
)

However, when I looked at the results for this new RatioToParent measure using Excel, I saw some weird values: only the to Category level showed me correct calculations and all other Product hierarchy levels had 100%.




Then I changed the order of levels in my DAX calculated measure from (Category > Sub Category > Model > Product Name) to (Product Name >  Model > Sub Category Category):

RatioToParent_Ordered:=
IF (
    ISFILTERED ( Product[HProduct] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) ),
    IF (
        ISFILTERED ( Product[HModel] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HModel] ) ),
        IF (
            ISFILTERED ( Product[HSubcategory] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HSubcategory] )
                ),
            IF (
                ISFILTERED ( Product[HCategory] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) )
            )
        )
    )
)

 and that made the whole difference and resolved the issue of miscalculated results:




So my learning lesson in this exercise was that order of hierarchical levels in DAX calculations is very important. Otherwise, calculations can only be correctly fulfilled for the top parent level of your tabular model hierarchy (please see the first DAX calculations). Lesson learned, mistakes not to be repeated!

Happy data adventures!


Friday, March 31, 2017

SSIS Web Service Task: misleading role of WSDL file

[2017-Mar-31] You can call one of your web service methods in SSIS package using a .NET script task or you can just use an existing Web Service task. The latter option is easier, UI based, doesn't take too much time to set up and Input/Output parameters are very intuitive to navigate.

There is one gotcha with the use of WSDL file though that is not fully documented but is worth to be mentioned. When you initially setup a Web Service task in SSIS, you need to provide a web server URL for your Http connection (let's just take this for example: http://www.webservicex.net/Statistics.asmx?WSDL), then you need to locate a local WSDL web service reference file with all available methods within. Then you test your ETL packages, it works well and all is merry and bright. However if you think that by changing the Http connection for your web service (let's say from Dev to Prod) you will be calling a Production version of your web service method then, actually it's not the case. It still going to be your Dev environment. I've learned this the hard way at my recent BI project, testing a web service call in UAT environment but for unknown reasons I was still getting results from the original Dev environment.

<wsdl:service name="Statistics">
     <wsdl:port name="StatisticsSoap" binding="tns:StatisticsSoap">
         <soap:address location="http://www.webservicex.net/Statistics.asmx"/>
     </wsdl:port>
</wsdl:service>

Apparently an SSIS Web Service task uses WSDL file defined connection to your web server stored in the soap:address tag regardless of what your Http connection would specify your web server location is.

My way to resolve this problem was to create a separate WSDL file for my UAT environment web service and make it a part of SSIS environment configuration, i.e. each different environment Htpp web service connection would be coupled with a corresponding WSDL file in order to prevent data result confusion during SSIS control flow execution, and don't forget to make those two parameters configurable.

Please let me know how you had to resolve a similar issue with the use of SSIS Web Service task after you deploy your ETL package to a different environment.




Tuesday, February 21, 2017

External SQL files vs. Stored procedures in SSIS development

[2017-Feb-21] I'm creating this particular blog post more for myself to remember and use it as a reference to related discussions in future.

One of my client's existent SSIS solution contains a use of multiple SQL execute tasks which are configured to use external files with a complicated SQL code. Such solution architecture design is not very sustainable and my recommendation was to replace those external SQL code files with a set of stored procedures on a SQL Server side.

This client further questions to outline benefits of using stored procedures vs. external files, along with preferable way to use hard coded SQL code vs. stored procedure calls in SSIS packages has led me to write this.

Stability and Security issues:
- By keeping data extract logic in external SQL files will make SSIS data flows tasks more vulnerable to a risk of undesirable SQL scripts change or even worse to a risk of deletion those files which may fail the whole data flow process. And by depositing the same SQL extract logic into stored procedures you now rely on a very stable SQL Server engine to execute your expected set based data mechanism.
- Security wise, externals SQL files requires an extra safekeeping methodology to prevent those files from harmful changes: folder with those scripts will have to be properly secured and additional security policy will have to be defined and maintained. With SQL logic in SQL Server stored procedure you don’t need to rely on something external to protect your data flow task: an existent application service account that already has read/write access to corresponding databases in SQL server will only need to be granted with execute permission for new SQL Server procedures.

Basically, porting SQL code from external files into stored procedures on a SQL Server will prevent your data driven application from a so called "SQL injection" security threat. When both your data and complex data load logic lives in a SQL Server environment, you feel safe and secure!

Recommendations for using SQL code in SSIS:
- There is no right or wrong of using a hard coded SQL code within a SSIS package, in some case it’s valid approach to run some simple SQL commands against a database engine. However when a logic of those SQL commands becomes complicated, it’s recommended to wrap this code into SQL Server procedures and then just execute them within your SSIS control/data flows.
- SQL Server procedure becomes then a better option from a development prospective with a rich set of tools to design and test your data extract code.  Plus, with any further changes you won’t need to re-deploy your SSIS packages and only a code of your stored procedure will have to be updated. 

So using stored procedures for SSIS development in most cases it’s a win-win solution for a developer: it saves development and deployment time for your SSIS solution!


Sunday, January 29, 2017

How to compare two SSAS databases with a custom MDX query

[2017-Jan-29] It came as a request from one of my customers where I had worked as a consultant to quickly compare two similar SSAS databases located in different environments. So I quickly ran an MDX query with some key metrics against each of the SSAS databases and presented the results to the customer. Further along the way, they asked me if I could make it as configurable tool with options to change SSAS servers, database names and possibly adjust some main filtering criteria.

I thought that it would be possible and decided to created an Excel workbook with two OLEDB data connections to SSAS databases with customer MDX queries and additional configurable parameters to adjust those data connections.

There is a very good article that explains how to create a custom MDX query based on a data connection in Excel (Excel – Using a Custom MDX Query in Excel Table) by Gilbert Quevauvilliers, which is not very intuitive. With that thing covered I also needed to get a full list of my cube metrics to build an MDX query (i.e. manually retyping them wasn't the case). SSAS dynamic management views could be very handy for this; with a lot information about SSAS DMVs available I'd recommend this article (SSAS: Using DMV Queries to get Cube Metadata by @BennyAustin) with some basic metadata extract examples already prepared.

In order to show you this work-case I've used the [Adventure Works Internet Sales Tabular Sample Database] and deployed it to my testing SQL Server SSAS 2016 environment. You can adjust your test case and can use your SSAS database either in multidimensional or tabular modes, because MDX data extract logic could be used in both cases.



Then I created this Excel file (SSAS Compare.xlsm) with all dynamic SSAS data connections and MDX queries to support my data compare results, make sure to enable Macro settings in your Excel.



Basically, it's all being done by the use of two SSAS OLEDB connections within this Excel file where I change Connection Strings to SSAS and Command Text for MDX query.




so then it was very easy to create a macro within this Excel workbook and let the configuration magic works:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
    ActiveWorkbook.Sheets("MDX Query").Select
    strMDX_Query = Range("A1").Value
            
    ActiveWorkbook.Sheets("Compare").Select
    
    '' Period settings
    strMDX_Period_1 = Range("B7").Value
    strMDX_Period_2 = Range("E7").Value
        
    '' Server settings
    strSSAS_Server_1 = Range("B5").Value
    strSSAS_Database_1 = Range("B6").Value
    strSSAS_Server_2 = Range("E5").Value
    strSSAS_Database_2 = Range("E6").Value
    
    With ActiveWorkbook.Connections("SSAS_Environment_1").OLEDBConnection
        .Connection = "OLEDB;Provider=MSOLAP.5;Integrated Security=SSPI;Persist Security Info=True;Initial Catalog=" + strSSAS_Database_1 + ";Data Source=" + strSSAS_Server_1 + ";MDX Compatibility=1;Safety Options=2;MDX Missing Member Mode=Error"
        .CommandText = Replace(strMDX_Query, strMDX_Filter, strMDX_Period_1)
    End With
    
    With ActiveWorkbook.Connections("SSAS_Environment_2").OLEDBConnection
        .Connection = "OLEDB;Provider=MSOLAP.5;Integrated Security=SSPI;Persist Security Info=True;Initial Catalog=" + strSSAS_Database_2 + ";Data Source=" + strSSAS_Server_2 + ";MDX Compatibility=1;Safety Options=2;MDX Missing Member Mode=Error"
        .CommandText = Replace(strMDX_Query, strMDX_Filter, strMDX_Period_2)
    End With
    
    ActiveWorkbook.Connections("SSAS_Environment_1").Refresh
    ActiveWorkbook.Connections("SSAS_Environment_2").Refresh
        
    MsgBox "Compare has been finished successfully.", vbInformation, "SSAS Measure Compare"

As a result I was able to dynamically configure SSAS servers and databases; also, I had my MDX query that I used to test for both SSAS data connection and I also added a filtering criteria to enhance this scenario. Feel free to change this file (SSAS Compare.xlsm) any way you want to in order to accommodate your own compare test case.

And have a happy data adventure!

Friday, November 4, 2016

Open Data is like a soup

[2016-Nov-04] Open data is like a soup, some people like it hot (or made with all the latest updates); some people like it spicy (probably experimenting with various data representation forms). And only on some special occasions you'd prefer a c-old soup over a hot one (where old data analysis is driven by a search for some historical patterns or insights).

I'm not new working with data, and in a corporate world most of the datasets are highly structured and regulated. With open data... it's bit different, less regulatory norms and sometimes more difficult to work with this data (it's open, whom do you complain to about it).

I've seen cases when a data custodian made datasets available for public with only old data (like very old, 10 years and more), at the same time presenting their own visualization using data which might not have existed moments ago. In other cases, open data was made available in a form of Excel files which you could open and read but to aggregate one geo region for several years it would take almost a hundred of those files, plus their data format was not a very structured and numerous data massage moves were necessary. Good examples of an open data management also exist where data owners responsibly control availability and updateability of their data.

Going back to the soup analogy, this whole idea working with data and especially with open data reminded me the ending scene of the classic movie "Coming to America" with Eddi Murhpy.




In case if this YouTube video gets removed, here is the text of this scene:
"Wait a minute, wait a minute, wait stop right there! Listen: Stop right there, man. A man goes into a restaurant. You listenin'? A man goes into a restaurant, and he sits down, he's having a bowl of soup and he says to the waiter, waiter come taste the soup. Waiter says: Is something wrong with the soup? He says: Taste the soup. He says: Is there something wrong with the soup? Is the soup too hot? He says: Will you taste the soup? What's wrong, is the soup too cold? Will you just taste the soup?! Allright, I'll taste the soup - where's the spoon?? Aha. Aha! ..."

Last week at the Open Data Toronto meetup (https://www.meetup.com/opentoronto/events/234322841/) there was an opportunity to review various open data sets and data visualizations built on them over the course of the last 1.5 years. Different vendor tools along with various visualization techniques were used to present data insights: from simple tables or charts to morphing maps and word clouds.

And then I realized that when an open data inquiry and a right and appropriate data exploration tool intersect with each other, only then you can get this Aha moment; no more explanations are needed and a data visualization speaks for itself. This really motivates me to explore other open datasets, try different tools and visualization techniques and read books about effective data analysis.

Otherwise my data soup may get cold without a right data spoon :-)

Happy data adventures!