Tame Arbitrary Data in Power BI

Have you ever had to prepare a proof of concept for a BI system or report, having nothing more than a group of aesthetically pleasing user-created excel files? Then you will know how getting all that non-machine-friendly data into a database can consume so much time. Time you could use instead invest in preparing useful charts to show better insight into the business.

One of the very useful things about Power BI is how it lets you transform irregular data on-the-spot, without a need for prior complex treatment on another tool or system. This means you can take a jumble of assorted data and turn it into a usable model quick – an invaluable feature for the conceptualization phase, or even for the delivery phase, when you have no control over the format of your data feeds.

In this article you will see an example of how to do this in Power BI.

The Arbitrary Data

Consider the nice-looking worksheet below.

This example worksheet lists Daily Active Users and Daily Conversion numbers for two of your company’s mobile games. As you can see, the data in the worksheet looks good as far as human eyes go… Yet databases will have trouble making sense of all that pivoted data. With a classic ETL tool, you’d generally take one of these approaches…

  • Manually unpivot this data straight in Excel.
  • Load it as-is into the database, using some smart SQL to make sense of that data.
  • Use your ETL to unpivot the data. (Let’s not even go there. Just no.)
  • Argue on how you need access to the source system data and hope to do things properly.
  • Some other time-consuming approach.

The reality is, when you are doing a proof-of-concept, you want to make the best use of time. You want to fail fast, so you can find what works and discard what doesn’t, and present relevant options to the stakeholders while their value is maximal. Designing a robust system is important but you need to make a case for it in the first place. And you want to apply any valuable feedback as quick as possible too.

This is where the Query Editor in Power BI comes in. Its step-based engine lets you setup many transformations on the spot, some of which are very cumbersome to do in SSIS or other general ETL tools.

Let’s see how to use the Query Editor to turn this user file into usable data in a couple of minutes.

Taming That Data

As usual, you start by Getting Data in Power BI Desktop…

This particular example is an Excel file, but the general principles below will work with other data sources as well.

You can click Load to load the worksheet as-is and stay in the current window, or click Edit to do that and jump straight into the Query Editor window.

If you clicked Load, you can always go into the Query Editor window by clicking the Edit Queries button in any pane.

In the Query Editor window, you can see Power BI has loaded the worksheet pretty much as-is…

To make it usable to for analysis purposes within Power BI, we have to transform this into a columnar format as best possible.

So where to start? Well, you can start by discarding all the non-essential aesthetics-only rows, such as that Game Report title and the line below it.

You can remove the top 2 rows by clicking on the Remove Rows Button in the Ribbon and then selecting the appropriate method.

A window will pop up, asking you for the number of rows you want to remove – in this example 2.

If you look at the Applied Steps in the Query Settings pane, you’ll see this step applied to the query.

The next step here is to change the data type of that Date column to Text.

This isn’t immediately obvious, but keep going, and you’ll see why in a second.

Now comes the big bit! Let’s Transpose the entire dataset!

The idea here is to isolate the Game Names into their own column. You can now see that your game names, Mysterious Adventure Game and Crazy Hack-n-Slash are now happy in their columnar spot…

There are a couple of null cells between them, but before we fix that, we need to take that Date list out-of-the-way, while keeping the data consistent. We can do this by promoting that row of dates to table headers…

As you can see below, we still have some data pivoted (though in a different way than the worksheet) but at least we now have the Game column properly isolated.

Now the reason we changed the data type of the Date column to text earlier was for this step to work in the first place. As of this writing, Power BI will not promote date/time values to headers. If you had gone straight to this step without the earlier data type conversion, you would have kept seeing Column3, Column4, Column5, etc, as the column names. Changing the data type to text before hand allowed this step to work fine.

Let’s finish the job on this column by Filling Down values on it to clean up those null cell values.

If you’re acquainted with Excel, Fill Down does exactly what you expected it to do…

Now we take care of the rest of the dataset. All that’s left to do is to unpivot all those dates again, while pivoting the measures. This is almost the opposite of what we did when we transposed the dataset, except now we want to keep that Game column nice and tidy, while distributing the measure values over the rows. You can do this by selecting all the date columns and then clicking the Unpivot Columns button on the Ribbon…

And this is the result! Almost done!

You can finish by sprinkling some general neatness and adjust the column names and data types to something that makes more sense…

Once you’re happy, you can close and apply to get back to the main window…

And now you can finally play around with the data and draw up some charts from it!

With a quick Power BI chart, you can see that while Crazy Hack-n-Slash is going bust fast, Mysterious Adventures is gaining good traction with players, and that the design has some room there to monetize the game better and increase conversion rates.

Wrapping Up

Preparing data for analysis and reporting over user-formatted content is usually a time-consuming process, regardless of the road you choose. Power BI’s Query Editor can help speed up the process a lot when your options are few. This doesn’t remove the need for proper analysis of the source systems and attentive data modelling. However, it does hasten the concept stage and gets you up and running to show the decision power of your data, pun intended.

Central Tendency In R: What, How & Why

The first step in analysing continuous data is understanding its Central Tendency, including what caveats there are and why they matter.

Earlier in this series, you saw the difference between continuous and categorical data and how to inspect a dataset to deduce which one is which. This is important because you must analyse these types of data in different ways.

Once you get comfortable with those themes, it’s time to get some good old analysing going. Today you’ll see some simple ways of getting an initial feel of all that data you hold in your hands, just begging for your attention.

As you saw before, analysing continuous data is a bit different from analysing categorical data. There are distinct techniques and indicators that you can apply to each type of data, so it’s best to focus on each type in its own way.

This post covers how to start analysing continuous data by looking at its Central Tendency, what things to watch out for and why they matter.

Aspects, Never Too Many

To get an initial feel of how a set of values – and let’s start calling this a vector, by the way! – of continuous data types looks like, you can start by focusing on two facets of it. These are not the only facets, but are a reliable starting point.

One is the Central Tendency. This tells you about what values does the data in your vector tend to converge to.

The other one is the Spread. This tells you how dispersed the data in your vector is around that Central Tendency.

There are different ways of evaluating each aspect and you must to think about the most appropriate ones for the dataset at hand.

Let’s take a look at a couple of basic measures of central tendency to get you started.

Mean (a.k.a. Average)

One way you can evaluate the central tendency is with a mean, which you may also know as either arithmetic mean or average. A mean gives the exact value around which the set of all values in your dataset converge to. A mean is a very accurate way of finding the central tendency of a vector, assuming your data is clean, meaning it has no unexpected outliers or any kind of invalid data. If it’s not clean, then two very important drawbacks will materialize, which you must pay close attention to.

To understand this well, let’s go back to basic maths and look at an example. Imagine you are looking at app-store rating feedback for your super-duper new adventure game. Your brand-new marvel of mobile entertainment has so far drawn five ratings.

How do your calculate a mean for this? Well, that’s the easy bit. You can use this formula:

You have five ratings there, so you would calculate the mean as:

That’s quite decent rating for your new game, even if you only got five ratings yet. Now this isn’t a perfect rating – not only because it isn’t five stars – but it gives you an overall view of what players think of your game, with a single number.

Now as you use the mean in your analysis, it is very import to understand not only its benefits, but also its caveats. The mean has in fact two important issues that you must pay close attention to:

  • It swings after extreme outliers in your data faster than a Lindy Hop dancer swings his partner into the ceiling.
  • It hides symmetric dispersion in your data better than Wally hides himself.

Extreme Outliers

In the example above, all the values were valid… for the sake of the example, of course. However, imagine that two of your ratings were, in fact, bad data. You know, the kind of data that pinches you in the back when you’re drinking hot coffee.

You can already foresee what’s going to happen when you run a mean over this, can’t you?

Yeah, you guessed it…

Ouch! 2.2 is a far different value than you had before, isn’t it? As you can see, you must pay attention to whether you are using clean data before you fully trust the results of a mean. And even when that is true, perfectly valid extreme outliers may still throw off your mean haywire when you’re not looking. There is another way to compensate for this, but first let’s take a look at the other caveat of the mean.

Symmetric Dispersion

The second caveat of the mean is that hides dispersion in your data, when that dispersion goes in opposite ways from the central value in roughly the same absolute range. But easier to understand with an example.

Consider this rating scenario now:

Here, all the values are equal to 3 and as you can infer, the mean is also 3. In business terms, you can tell that your game rates pretty average, yet it is also stable in that regard. Players doesn’t really love it but they don’t really hate it either.

Now consider this other scenario:

Hmm, a bit different, isn’t it? Now, let’s calculate the mean for this:

Still 3. The exact same as the initial example. And herein lies the caveat with the mean. If you only rely on means for your analysis, you may miss out on very important insight. In the example above, you will miss out on the extreme spread of data. The spread in the example above tells you that, while overall rating appears average, in reality:

  • Your game may have some features, behaviours or bugs that some users hate enough to rate it a miserable one star
  • And it may have some features that some users love enough to rate it a brilliant five stars.

It is this insight that can drive further investigation of the reasons people either hate or love your game, decide actions to take or lessons to learn, and on subsequent updates – or even a sequel or a future game – help you attain a higher overall rating, and with it, more popularity and profit.

So you can see that while the mean is very useful to understand the central tendancy of your data, there are some points you must pay attention to in doing so. The good news is, there are other indicators that help you out with that job. But before we go into that, you must be asking how do I calculate a mean in R anyway?

Like most basic maths in R, it’s dead easy. Just use the mean() function, go figure!

For example, and using the iris dataset again (check here to see how to get it up and running), you can calculate the mean of all petal widths with just one call and some smart sub-setting:

mean(iris$Petal.Width)

If you’re wondering what that dollar sign does, it just sub-sets the Petal.Width vector (or column, kinda) from the iris data frame (a table, kinda). This kind of sub-setting-on-the-fly is one of the neat things in R that can save you tons of writing.

Anyway, this returns you…

1.199333

Looking roughly at the data points, this sounds about right. However, if you’re picky and you pay further attention to how the data is clustered, you will notice something important…

Look at the typical petal widths of the Setosa variety…

Now look at the respective petals widths for the Virginica variety…

That’s a quite a difference, isn’t it? An order of magnitude even.

In terms of flowers on their own, this may not matter that much. However, if these values are tied to important business indicators, such as the game ratings example from before, that order of magnitude may turn out to be very important indeed.

So in the above example, you’ve seen how to calculate a mean, how to interpret it, and most importantly, what caveats to watch out for. The good news is, there are other indicators to help you out with those caveats.

Median (a.k.a…. err… Median?)

So another way you can evaluate the central tendency is with a median. A median is nothing more than the value at the exact middle of an ordered version of your vector, without regard to how far apart each value is from others. The advantage of the median over the mean is that it far more resilient to extreme values on either the low or high ends skewing your results. However, it also has its own caveats. All of this is easier to understand with an example.

Take a look at those star ratings again…

To calculate the median, you must first order these values from lowest to highest – or the other way around if you want to going against the flow, that works too! – and then choose the value right in the centre of the sequence.

There are five values here, an odd number, so the median is the third value of this sequence, the one right in the center, and that’s 4. If you had an even number of values, you would just average the two in the middle. Anyway, you can argue this is a less accurate way to evaluate a central tendency as compared to the mean, for a number of reasons. However, imagine that the first two values in this sequence were – somehow – zero. In this scenario, while the mean would go haywire immediately, the median would hold its ground.

Yep, still 4. This is the strength of the median over the mean. As long as most of your data is not made of extreme outliers, the median will hold fairly reliable, even if a bit less accurate overall, depending on how you define accurate to begin with. In fact, the median is the most resilient measure of central tendency, when it comes to datasets with dubious data. As long as less than 50% of your data is contaminated by outliers, the median will give you a fairly reliable result.

On the other hand, you can already see the one big caveat of the median, just from the example above. And it’s similar to the caveats of the mean, so no news here. Because it relies exclusively on the value at the exact middle of the sequence, the median is not sensitive at all to any dispersion in the data around that central value, even asymmetric. Whatever value happens to be in the very middle of your ordered set, that’s the value you get. Then again, evaluating dispersion is not the job of central tendency measures in the first place. That’s what spread measures are for, which is the theme of the next post.

Now, if you were paying close attention, you may be wondering something…

Hum… Wait a minute. What if I calculate both mean and median and compare them… Won’t that give me a measure of hum… skewness?”

If you thought of this, you’re getting the hang of it! Kudos to your inquisitiveness! 🙂

Now let me bear the bad news for once and say that, nope, sadly that’s generally not true. It does sound intuitive and you may even have read it in your basic maths textbook. In reality, the apparent relationship between the two is not reliable enough for you to trust in any serious analysis. But hey, cheer up! Part of knowing what works is knowing what doesn’t!

Now, at this point, I’m sure you’re already guessing how to calculate a median in R. If everything else up until now has been an one-liner, why shouldn’t this one be too? Well, you’re right on that one!

Here’s how you can calculate the median of all petal widths in the iris dataset:

median(iris$Petal.Width)

Which gives you the not so different value of 1.3. Again, perhaps a bit less accurate if you assume the data is clean, but much more robust if that assumption is, in fact, not true.

Is This Enough?

Not quite. While measures of central tendency give you a sense of where data leans toward, alone they are not enough to give you a clear portrayal of your continuous data. As you saw in the example below, central tendency does not tell you about minimum and maximum values in your data. You may have two completely different datasets with the exact same central tendency metrics. To gain a better understanding of your continuous data, you must also take a look at spread measures, which let you know how close together – or far apart – your data really is.

Make sure to keep an eye on this blog, as in the next few posts you’ll see various ways of measuring spread in your data and how to apply this knowledge in R.

See you next time!