Search This Blog

Friday, November 4, 2016

Open Data is like a soup

[2016-Nov-04] Open data is like a soup, some people like it hot (or made with all the latest updates); some people like it spicy (probably experimenting with various data representation forms). And only on some special occasions you'd prefer a c-old soup over a hot one (where old data analysis is driven by a search for some historical patterns or insights).

I'm not new working with data, and in a corporate world most of the datasets are highly structured and regulated. With open data... it's bit different, less regulatory norms and sometimes more difficult to work with this data (it's open, whom do you complain to about it).

I've seen cases when a data custodian made datasets available for public with only old data (like very old, 10 years and more), at the same time presenting their own visualization using data which might not have existed moments ago. In other cases, open data was made available in a form of Excel files which you could open and read but to aggregate one geo region for several years it would take almost a hundred of those files, plus their data format was not a very structured and numerous data massage moves were necessary. Good examples of an open data management also exist where data owners responsibly control availability and updateability of their data.

Going back to the soup analogy, this whole idea working with data and especially with open data reminded me the ending scene of the classic movie "Coming to America" with Eddi Murhpy.

In case if this YouTube video gets removed, here is the text of this scene:
"Wait a minute, wait a minute, wait stop right there! Listen: Stop right there, man. A man goes into a restaurant. You listenin'? A man goes into a restaurant, and he sits down, he's having a bowl of soup and he says to the waiter, waiter come taste the soup. Waiter says: Is something wrong with the soup? He says: Taste the soup. He says: Is there something wrong with the soup? Is the soup too hot? He says: Will you taste the soup? What's wrong, is the soup too cold? Will you just taste the soup?! Allright, I'll taste the soup - where's the spoon?? Aha. Aha! ..."

Last week at the Open Data Toronto meetup ( there was an opportunity to review various open data sets and data visualizations built on them over the course of the last 1.5 years. Different vendor tools along with various visualization techniques were used to present data insights: from simple tables or charts to morphing maps and word clouds.

And then I realized that when an open data inquiry and a right and appropriate data exploration tool intersect with each other, only then you can get this Aha moment; no more explanations are needed and a data visualization speaks for itself. This really motivates me to explore other open datasets, try different tools and visualization techniques and read books about effective data analysis.

Otherwise my data soup may get cold without a right data spoon :-)

Happy data adventures!

Tuesday, October 18, 2016

Power BI with Narrative Science: Look Who's Talking (Part 2: Predicting the Future)

[2016-Oct-18] I read a book one time about the importance of using visual aids in teaching. They also used a well known illustration that a human optic nerve is thicker than the auditory one; thus, resulting to more information being consumed by brain with seeing things rather than hearing or listening to them. I still believe in this, that a picture is worth a thousand words.

However now we tend to put more visual elements to our business reports and then we spend more time trying to understand a story that those elements may tell us, rather than having a direct conversation or explanation.

In one of my previous blog posts I had already talked about the Narrative Science company and their product to enhance visual business reports with textual narratives (Power BI with Narrative Science: Look Who's Talking?). Basically the Narrative Science component takes numeric report measures and analytically generates a story that business users can read.

Recently they have released an update for the Power BI Narrative Science component (Narratives for Power BI 1.2 Release Notes) and added a few more features to enable more complex narratives customization and data analytics. Here is a brief list of those new features:
  1. Story Inclusion Thresholds
  2. Custom Formatting
  3. Predictive Analytics
  4. Additional Measure Relationships

And Predictive Analytics excited me the most, yes, I wanted to know the future! So, I needed to test this new Narrative Science functionality with an idea in mind that this could enhance my customers' Power BI reports.

There are a few prerequisites for your dataset though to start working with the Trend Analytics functionality of the Narrative Science:

- it has to be a time series of metric/metrics being measured

- it also requires at least 30 time intervals in order to start populating textual predictions if possible.

I took an open dataset that tracks water main breaks within the City of Toronto (Watermain Breaks) for the last 26 years (1990-2015). With the following data columns:
- BREAK_DATE - Date of watermain break reported
- BREAK_YEAR - Year of watermain break reported
- XCOORD - Easting in MTM NAD 27(3 degree) Projection & Lat and Long
- YCOORD - Northing in MTMNAD 27(3 degree) Projection & Lat and Long

Then I quickly aggregated this data and placed it into a column chart:

With the Narrative Science component and my (Break Date, Break Count) dataset I was able to receive a very decent textual story of the 26 years of history of the water breaks:

However along the way I receive the following note about existing limitation of working with my dataset, which I believe will be fixed in future releases:

The other thing that I couldn't solve with my dataset was the actual prediction of future results. No matter how hard I tried to either change data aggregation level or filtering options, I kept getting this message along with more helpful narratives, "A prediction could not be made because [metric name] did not have a good linear trend".

I assume that for the Trend Analytics a linear extrapolation algorithm is used and it's very peculiar to the historical data. However I do believe that this Predictive Analytics functionality of Narrative Science would be enhanced in future too, because for me seeing that a number of breaks is greater in the winter months because of low temperatures is a good trend :-).

In overall I'm very happy to see all these new features being introduced in the recent update release of the Narrative Science for the Power BI. And I will be recommending my customers to start using it more in their reports.

Because you don't say a word, it's the Narrative Science who does :-)

Happy data adventures!

Friday, September 30, 2016

Historical Reports vs. Predictive analytics: Rear view mirror look vs. Forward-looking view. Really?

[2016-Sep-30] Working with reports based on historical data is like driving a car while looking in the rear view mirror. I heard this term again when I went to the Microsoft Data Science Summit in Atlanta ( this past week. Which gave me a second thought to consider: how do we define a historical fact and what sets a current time frame from anything else (i.e. history).

Then I read good article "Myth: Managing by historical data is like driving while looking in the rear view mirror" by Allan Wille and realized that I'm not alone in this camp of doubters in the real meaning of the ineffective rear view mirror driving vs. better front driving for business metrics analytics.

In this article Allan outlined three types of a business performance analytics:
1) Current performance (real-time or near-real-time)
2) Timeframe-appropriate historical performance data
3) Predictive or conditional alerts (based on goals, history, or environmental factors).

And I would go even further, I believe all reporting that we currently poses with our modern existent technology is based on historical data. Stream analytics is based on some near-real-time events but they still happen a few moments ago, they're already history. Predictive analytics which aims to predict future events are based on analytical models which usually are trained and tested using historical data sets. Even in cognitive analysis we will have to use a large data repository that vendors may have collected in order to work with image or speech recognition technologies.

My middle school teacher of physics had planted this seed of doubt a while ago when she confronted our class with an idea that a car speedometer never showed a real-time vehicle speed. Because there is always a latency between different components of this mechanism. And speed measurement, that I can observe at the moment, may simply be a real representation of the speed fact a few moments ago.

So, if I want to become a data scientist, I will start defining what I mean by a historical fact, at least to myself. Otherwise I may assume a car in my rear-view mirror as history while it overtakes me and starts showing its rear lights in my front view instead.

Wednesday, August 31, 2016

Power BI with Narrative Science: Look Who's Talking?

[2016-Aug-31] Someone said a picture is worth a thousand words and this is true that in most cases a visual representation of a fact is absorbed better compared with a verbal or textual version of it.  However sometimes just looking at a “picture” or analytical report it also would be beneficial to have some narratives along the way. 

It’s just like when I take my family to an art gallery to observe various paintings then it helps to look at some exhibit notes on the museum walls or tiny descriptions beneath an artist painting; and then it comes, that “Aha” moment, after which I can tell to myself, that’s what he (or she) meant by drawing that piece or art.

In a business world of analytical reports, we understand this very well; we don’t just design and create reports filled with all sorts of visual elements, we provide report titles, metric descriptions and sometime we may add a text box with words that describes what users can see and interact with.

Working for more then 5 years in a marketing department of one company in the past I’ve seen many PowerPoint slides with graphs and tables where additional value was placed on what marketing managers could derive from visual elements of their slides and put into verbal sentiments besides them. And I know that sometimes it had been a very time-consuming task; I wish they had some help them during the time preparing those slides.

There is one company that actually does it, yes, you prepare your data set then create your report visualization and after adding its external component it begins textually telling stories about your data and users can see that live and interactive, isn’t that amazing!

This company is Narrative Science ( and it recently released a report component for the Power BI that does exactly this (

This component only requires a data set with numeric measures and corresponding textual dimension attributes; after this its internal engine creates textual narratives in a form of a free text of set of bullet points. And the great thing is, if you start interact with your report visuals (filters, slicers, parameters, etc.), the Narrative Science component will reflect those changes and update its narratives in real time.

At the recent Toronto Open Data book club we were discussing a Learning Opportunities Index that is calculated for all the elementary and secondary schools of the Toronto District School Board. Here is their data set for 2014 that I had used to create this Power BI ( The higher this LOI score is for a particular school the more this school is needed in additional support to provide better learning opportunities for its students.

If you notice at the top right corner of the report there is a text box with a footnote “Powered by Narrative Science”, what you see inside is a product of this component, and I didn’t write any of those words.

Here is a slightly changed version of this text box after a clicked a bar that corresponded to the Ward 11, which holds the lowest LOI score value (which is very value good by the way).

and here is a data set that I had used for those narratives:

Where only School name and associated LOI Score were used:

After this I just had to set visual attributes for this Narrative Science component. Where Type could have four different values (Discrete, Continuous, Percent of Whole, Scatterplot), Structure could take a form of Paragraph or a Bullet list, and Verbosity could be High, Medium and Low.

And the final step would be to adjust the Language settings (all other settings I left unchanged):

Now all numeric measures are properly displayed on my bar chart, report map component correctly displays school locations and Narrative Science component explains and talks about this all at the same time. I strongly believe that there is great opportunity to use this component in various analytical scenarios and user will appreciate this very much, because it does help when someone tells you a story behind a data that you try to visually analyze.

And now you don't have to guess who's talking, it's Narrative Science! :-)

Happy data adventures!

Sunday, July 31, 2016

Power BI: If you have a Cloud of Words, let them Fly!

[2016-July-31] Last week was my 2nd time participating at the local Toronto Open Data meetup ( The audience of this meetup is quite diverse and covers people with various occupations, such as: data enthusiasts, students, public service workers and other citizens who are united under one idea to explore publicly available open data sets and discuss ideas on how to improve public services of the city of Toronto.

This time we were looking at the 211 Youth Services Data data set that lists all social services available for youth in Toronto. 

Before I begin to share my experience working with this data set and building my new Power BI visualization, here is a link to a live Power BI report 211 Youth Services (Word Cloud) that I had a chance to present there

The data set itself was an Excel file with 24 worksheets representing different categories of services for 1447 agencies, where each of the agency had the following attributes:
  • AgencyName
  • LegalStatus
  • OfficePhone
  • EMail
  • Website
  • Address
  • Neighbourhood
  • Accessibility
  • Hours
  • Eligibility
  • Languages
  • Application
  • DescriptionService
  • DateUpdated

And I thought to myself, what if I can retrieve more insights out of this data in order to get better understanding of what actual work each of them do? And is there an efficient way to work work with such unstructured text data. So, I decided to specifically look at the DescriptionService column which is a free text attribute that describes different activities and programs for each of the agencies. 

And I think, that I've found a perfect visualization instrument that works really well with text data; it's the Word Cloud visual control was introduced for Power BI in November of 2015 (Visual Awesomeness Unlocked - The Word Cloud). Basically this visualization shows a set of words in a form of a cloud where font size of each word is defined by how frequently it is used within a given data set.

Usually they demonstrate a use case with Twitter data being analyzed, however I decided to challenge this with a text field from 211 data set where column size varied from 17 to 4197 characters for different agencies. Could Power BI Word Cloud handle that much of text data in a single row and could I work with at least a thousand records to build my visualization? It actually worked!

The beauty of the existent Power BI reporting approach is that your can build a report where each of the visualizations can interact with each other. So if I click the word "employment" in my Word Cloud visualization then both Bing Map and Agency count with this word are affected as well:

It's amazing that such interaction can drive further data analysis and possibilities are limitless: from better understanding of high frequently activities and outlier analysis within a set of programs that each agency provides.

A few additional thoughts if you want to work with this Power BI Word Cloud visualization:
- Spend more time cleaning and preparing your data, you wouldn't want to let your cloud of words to be filled with some incorrect text sentiments.
- Keep the "Rotate Text" option Off, otherwise the Word Cloud looks a bit messy.
- And definitely set "Stop Words" option to On; it helps to hide some common words (to, from, or, and, etc.) from your cloud, also you can add some other exceptional phrases to your stop list.

In overall, I've enjoyed working with this Word Cloud visualization and would highly recommend it for any unstructured text data analysis. 

So, If you have a Cloud of Words, let them Fly! :-)

Thursday, June 30, 2016

Power BI Pulse Chart vs. Dancing bars in Excel

[2016-Jun-30] Recently Microsoft Power BI has been enhanced with a new Pulse Chart custom visualization (Visual Awesomeness Unlocked: Pulse Chart) which is extremely good and useful for a visual data story telling. I believe that the less we have people explaining their data visualization and more we let those charts and graphs talk for themselves is better for end users. Those self explanatory visualizations work as proxies for data that stays behind them.

Basically the Power BI Pulse chart lets you build a line chart based on a time series where additional columns could describe certain events and their descriptions during a range of timestamps of your data set. And the beauty of this custom visualization is that it has an autoplay feature that starts playing and builds your line chart along with showing your defined event descriptions; I think it's very fascinating and it also attract people's attention to your dataset.

I used an Ottawa transit company open data set that shows bus routes, bus stop locations and bus stop times of different dates within a year to build my Power BI Pulse chart ( So I created a data model that contained all bus stop times within a single day. Then I added some custom events with descriptions to briefly describe various Ups & Downs for the overall time stops. So here is what I was able to create, a Pulse chart based on regular day of the OC Transpo company.

It's the same dataset that I had already used in the past for my PowerMap demo in Excel; at that time I called it as Dancing bars of OC Transpo with basically showed the same busy day with bus stop times as a fact base.

and here is a YouTube link to the complete PowerMap video: Dancing bars of OC Transpo – Busy day 3D Map of transit service for Ottawa

There is one limitation though that I had experienced building my Pulse Chart. Current version of this visualization supports only a Date hierarchy (Year-Quarter-Month-Day) based on your timestamp dataset column. However, my OC Transpo time series was only within a single day, and my expectation was to see Hours on the chart X-axis, which it didn't have.  

I sent a note to Microsoft Power BI team and they were prompt enough to suggest to add this request into their pool of other Power BI ideas which then get further reviewed and used as a base for future improvement and corrections.  So perhaps, in a near future, my wish will turn into reality :-)

I hope you will find this blog post interesting and helpful. Happy data adventure!

Tuesday, May 31, 2016

Analyzing David and Goliath datasets on a same chart; Dual Axis in Excel, Power BI and SSRS

[2016-May-31] Two different metrics, two different datasets, one may represent a very high value and the other one is very low. What's interesting is that both metrics could be very close related; like Total Product Sales vs. Product Price or Average Monthly Temperature vs. Rainfall over a period of time. 

One way to analyse how such metrics may influence each other is to look at their numbers, the other way is to visually place them on a bar or a line chart and let those graphical elements communicate to you how correlated or not those metrics are. However there is one downside of using graphical tools to analyse both very low and high data values on the same Y-axis; it's just visually difficult. Therefore many analytical tool vendors allow to portray differently scaled data with the use of Primary and Secondary Y-axes (so called Dual Axes).

Let's take a simple data set and try to check 3 different visualization tools that can show data across different data axes. There is an open dataset of National Pollutant Release Inventory (NPRI) provided by Government of Canada ( that shows different industry related pollutant releases to air, water and land for the past five years. 

I've chosen only 2 industries' air pollution data: (1) Oil and Gas Extraction and combined (2) Pipeline Transportation of Crude Oil and Natural Gas, with an idea to find if these 2 metrics' results are correlated over a period of time.

Here is an Excel version of this dataset:

Just looking at the data itself, it's quite obvious that Oil and Gas extraction generate far more air pollution than these natural resources transportation by pipelines. Let's see how Excel chart could show this data:


Excel is a good analytical tool and it even gave me an option to see linear trend-lines for each of the metrics (however it's a bit difficult to understand why with the air pollution decrease for the oil and gas extraction, the air got more polluted when this oil and gas had been in a pipeline transition; we'll talk about this later).

Power BI:

The Power BI currently doesn't allow creating a bar chart for a secondary Y-axis; thus we can use a line to show the Pipeline transportation in our case. However all other settings including good data labeling provides us with a very interactive experience for big & small data analysis.

SSRS (SQL Server Reporting Services):

It's a bit different story with the SSRS: from one hand it has a very solid dataset extraction mechanism that is shared across all other SQL Server BI tools, however its visualization part to show both Primary and Secondary axes is not very good: you will have to additionally adjust each of axis top limits in order to sync horizontal lines and I also wish SSRS data label positioning to be more flexible.

I will let you to decide on your own what tool you would prefer to use in order to analyze 2 different metrics data on dual axes. I personally like the Power BI version, then I would go with the Excel way to analyze Primary and Secondary axes data and use of the SSRS dual axes chart would only be considered if the technology is specifically requested by a customer.

As far as the early question about the pollution decrease of oil and gas extractions vs. pollution increase to transport those natural resources using pipelines, one could suggest that during the period of time of 2010 and 2014 resource company have learned to apply modern technologies to produce less air pollution. However the existing pipeline system in Canada may have slightly deteriorated and need more maintenance and modernization. But whatever the reasons are for the air pollution decrease and increase, there have to be some some facts to prove such hypothesis, which may be a part of another blog post or discussion.

Happy data adventures!

Saturday, April 30, 2016

Power BI Box and Whisker visualization vs. Box Plot chart in SSRS

[2016-Apr-30] I don't work with box plot charts very often, however when I do then it becomes a very interesting experience. First, it starts from data extraction and then playing around with various visual settings to make those boxes and whiskers to look good.

I assume that sometimes we tend to work with more familiar chart types to analyze our data, like lines and vertical or horizontal charts; however box plot charts could be very well used to review how your data is distributed within your set of metric values; and besides, it looks different as well :-)

About 3 months ago Microsoft made this type of visualization available in Power BI (Visual Awesomeness Unlocked – Box-and-Whisker Plots), so I decided to compare it with the same box plot chart that I can create using SQL Reporting Services.

My data comes from Berkely Earth ( that I had already used for one of my previous PowerMap visualization (Earth Surface Temperature on YouTube). And I took 4 years data of May monthly average temperature from all Canadian Geo stations (1997 - 2000) and showed them both in Power BI and SSRS.

Power BI:
Currently there are two version of the Box & Whisker charts: (1) created by Brad Sarsfield and (2) another created by Jan Pieter Posthuma.

(1) Brad Sarsfield box and whisker chart
- Customization for chart quantile value is available 
- Outliers visualization
- Simple data set required Values/Group (values and groups those values are divided to)
- Margins for quantiles are rounded to zero decimal places

(2) Jan Pieter Posthuma box and whisker chart
- Boxes are colored 
Margins for quantiles are not rounded to zero decimal places
- No customization for quantile value (it's preset to Min, 0.25, Mean, Median, 0.75, Max)
- Along with Values/Group settings for dataset, Samples has to be specified (which is basically an attribute for lowest granularity you want to analyze your data)


- The only SSRS chart for box plot visualization
- More settings can be adjusted compared with similar Power BI visualization
- Chart dataset has to be prepared in advance for all 6 data elements (Min, Low, Mean, Median, High, Max)
The only SSRS chart for box plot visualization.

I personally like Power BI Brad Sarsfield box & whiskers visualization with the way it looks and its customization level, however I also like that margin values for quantiles are not rounded to zero decimal places at Jan Pieter Posthuma visualization.

Thanks for reading this blog post, and happy data adventures!