The first step in analysing continuous data is understanding its Central Tendency, including what caveats there are and why they matter.
Earlier in this series, you saw the difference between continuous and categorical data and how to inspect a dataset to deduce which one is which. This is important because you must analyse these types of data in different ways.
Once you get comfortable with those themes, it’s time to get some good old analysing going. Today you’ll see some simple ways of getting an initial feel of all that data you hold in your hands, just begging for your attention.
As you saw before, analysing continuous data is a bit different from analysing categorical data. There are distinct techniques and indicators that you can apply to each type of data, so it’s best to focus on each type in its own way.
This post covers how to start analysing continuous data by looking at its Central Tendency, what things to watch out for and why they matter.
Aspects, Never Too Many
To get an initial feel of how a set of values – and let’s start calling this a vector, by the way! – of continuous data types looks like, you can start by focusing on two facets of it. These are not the only facets, but are a reliable starting point.
One is the Central Tendency. This tells you about what values does the data in your vector tend to converge to.
The other one is the Spread. This tells you how dispersed the data in your vector is around that Central Tendency.
There are different ways of evaluating each aspect and you must to think about the most appropriate ones for the dataset at hand.
Let’s take a look at a couple of basic measures of central tendency to get you started.
Mean (a.k.a. Average)
One way you can evaluate the central tendency is with a mean, which you may also know as either arithmetic mean or average. A mean gives the exact value around which the set of all values in your dataset converge to. A mean is a very accurate way of finding the central tendency of a vector, assuming your data is clean, meaning it has no unexpected outliers or any kind of invalid data. If it’s not clean, then two very important drawbacks will materialize, which you must pay close attention to.
To understand this well, let’s go back to basic maths and look at an example. Imagine you are looking at app-store rating feedback for your super-duper new adventure game. Your brand-new marvel of mobile entertainment has so far drawn five ratings.
How do your calculate a mean for this? Well, that’s the easy bit. You can use this formula:
You have five ratings there, so you would calculate the mean as:
That’s quite decent rating for your new game, even if you only got five ratings yet. Now this isn’t a perfect rating – not only because it isn’t five stars – but it gives you an overall view of what players think of your game, with a single number.
Now as you use the mean in your analysis, it is very import to understand not only its benefits, but also its caveats. The mean has in fact two important issues that you must pay close attention to:
- It swings after extreme outliers in your data faster than a Lindy Hop dancer swings his partner into the ceiling.
- It hides symmetric dispersion in your data better than Wally hides himself.
In the example above, all the values were valid… for the sake of the example, of course. However, imagine that two of your ratings were, in fact, bad data. You know, the kind of data that pinches you in the back when you’re drinking hot coffee.
You can already foresee what’s going to happen when you run a mean over this, can’t you?
Yeah, you guessed it…
Ouch! 2.2 is a far different value than you had before, isn’t it? As you can see, you must pay attention to whether you are using clean data before you fully trust the results of a mean. And even when that is true, perfectly valid extreme outliers may still throw off your mean haywire when you’re not looking. There is another way to compensate for this, but first let’s take a look at the other caveat of the mean.
The second caveat of the mean is that hides dispersion in your data, when that dispersion goes in opposite ways from the central value in roughly the same absolute range. But easier to understand with an example.
Consider this rating scenario now:
Here, all the values are equal to 3 and as you can infer, the mean is also 3. In business terms, you can tell that your game rates pretty average, yet it is also stable in that regard. Players doesn’t really love it but they don’t really hate it either.
Now consider this other scenario:
Hmm, a bit different, isn’t it? Now, let’s calculate the mean for this:
Still 3. The exact same as the initial example. And herein lies the caveat with the mean. If you only rely on means for your analysis, you may miss out on very important insight. In the example above, you will miss out on the extreme spread of data. The spread in the example above tells you that, while overall rating appears average, in reality:
- Your game may have some features, behaviours or bugs that some users hate enough to rate it a miserable one star…
- And it may have some features that some users love enough to rate it a brilliant five stars.
It is this insight that can drive further investigation of the reasons people either hate or love your game, decide actions to take or lessons to learn, and on subsequent updates – or even a sequel or a future game – help you attain a higher overall rating, and with it, more popularity and profit.
So you can see that while the mean is very useful to understand the central tendancy of your data, there are some points you must pay attention to in doing so. The good news is, there are other indicators that help you out with that job. But before we go into that, you must be asking how do I calculate a mean in R anyway?
Like most basic maths in R, it’s dead easy. Just use the mean() function, go figure!
For example, and using the iris dataset again (check here to see how to get it up and running), you can calculate the mean of all petal widths with just one call and some smart sub-setting:
If you’re wondering what that dollar sign does, it just sub-sets the Petal.Width vector (or column, kinda) from the iris data frame (a table, kinda). This kind of sub-setting-on-the-fly is one of the neat things in R that can save you tons of writing.
Anyway, this returns you…
Looking roughly at the data points, this sounds about right. However, if you’re picky and you pay further attention to how the data is clustered, you will notice something important…
Look at the typical petal widths of the Setosa variety…
Now look at the respective petals widths for the Virginica variety…
That’s a quite a difference, isn’t it? An order of magnitude even.
In terms of flowers on their own, this may not matter that much. However, if these values are tied to important business indicators, such as the game ratings example from before, that order of magnitude may turn out to be very important indeed.
So in the above example, you’ve seen how to calculate a mean, how to interpret it, and most importantly, what caveats to watch out for. The good news is, there are other indicators to help you out with those caveats.
Median (a.k.a…. err… Median?)
So another way you can evaluate the central tendency is with a median. A median is nothing more than the value at the exact middle of an ordered version of your vector, without regard to how far apart each value is from others. The advantage of the median over the mean is that it far more resilient to extreme values on either the low or high ends skewing your results. However, it also has its own caveats. All of this is easier to understand with an example.
Take a look at those star ratings again…
To calculate the median, you must first order these values from lowest to highest – or the other way around if you want to going against the flow, that works too! – and then choose the value right in the centre of the sequence.
There are five values here, an odd number, so the median is the third value of this sequence, the one right in the center, and that’s 4. If you had an even number of values, you would just average the two in the middle. Anyway, you can argue this is a less accurate way to evaluate a central tendency as compared to the mean, for a number of reasons. However, imagine that the first two values in this sequence were – somehow – zero. In this scenario, while the mean would go haywire immediately, the median would hold its ground.
Yep, still 4. This is the strength of the median over the mean. As long as most of your data is not made of extreme outliers, the median will hold fairly reliable, even if a bit less accurate overall, depending on how you define accurate to begin with. In fact, the median is the most resilient measure of central tendency, when it comes to datasets with dubious data. As long as less than 50% of your data is contaminated by outliers, the median will give you a fairly reliable result.
On the other hand, you can already see the one big caveat of the median, just from the example above. And it’s similar to the caveats of the mean, so no news here. Because it relies exclusively on the value at the exact middle of the sequence, the median is not sensitive at all to any dispersion in the data around that central value, even asymmetric. Whatever value happens to be in the very middle of your ordered set, that’s the value you get. Then again, evaluating dispersion is not the job of central tendency measures in the first place. That’s what spread measures are for, which is the theme of the next post.
Now, if you were paying close attention, you may be wondering something…
Hum… Wait a minute. What if I calculate both mean and median and compare them… Won’t that give me a measure of hum… skewness?”
If you thought of this, you’re getting the hang of it! Kudos to your inquisitiveness! 🙂
Now let me bear the bad news for once and say that, nope, sadly that’s generally not true. It does sound intuitive and you may even have read it in your basic maths textbook. In reality, the apparent relationship between the two is not reliable enough for you to trust in any serious analysis. But hey, cheer up! Part of knowing what works is knowing what doesn’t!
Now, at this point, I’m sure you’re already guessing how to calculate a median in R. If everything else up until now has been an one-liner, why shouldn’t this one be too? Well, you’re right on that one!
Here’s how you can calculate the median of all petal widths in the iris dataset:
Which gives you the not so different value of 1.3. Again, perhaps a bit less accurate if you assume the data is clean, but much more robust if that assumption is, in fact, not true.
Is This Enough?
Not quite. While measures of central tendency give you a sense of where data leans toward, alone they are not enough to give you a clear portrayal of your continuous data. As you saw in the example below, central tendency does not tell you about minimum and maximum values in your data. You may have two completely different datasets with the exact same central tendency metrics. To gain a better understanding of your continuous data, you must also take a look at spread measures, which let you know how close together – or far apart – your data really is.
Make sure to keep an eye on this blog, as in the next few posts you’ll see various ways of measuring spread in your data and how to apply this knowledge in R.
See you next time!