Central Tendency In R: What, How & Why

The first step in analysing continuous data is understanding its Central Tendency, including what caveats there are and why they matter.

Earlier in this series, you saw the difference between continuous and categorical data and how to inspect a dataset to deduce which one is which. This is important because you must analyse these types of data in different ways.

Once you get comfortable with those themes, it’s time to get some good old analysing going. Today you’ll see some simple ways of getting an initial feel of all that data you hold in your hands, just begging for your attention.

As you saw before, analysing continuous data is a bit different from analysing categorical data. There are distinct techniques and indicators that you can apply to each type of data, so it’s best to focus on each type in its own way.

This post covers how to start analysing continuous data by looking at its Central Tendency, what things to watch out for and why they matter.

Aspects, Never Too Many

To get an initial feel of how a set of values – and let’s start calling this a vector, by the way! – of continuous data types looks like, you can start by focusing on two facets of it. These are not the only facets, but are a reliable starting point.

One is the Central Tendency. This tells you about what values does the data in your vector tend to converge to.

The other one is the Spread. This tells you how dispersed the data in your vector is around that Central Tendency.

There are different ways of evaluating each aspect and you must to think about the most appropriate ones for the dataset at hand.

Let’s take a look at a couple of basic measures of central tendency to get you started.

Mean (a.k.a. Average)

One way you can evaluate the central tendency is with a mean, which you may also know as either arithmetic mean or average. A mean gives the exact value around which the set of all values in your dataset converge to. A mean is a very accurate way of finding the central tendency of a vector, assuming your data is clean, meaning it has no unexpected outliers or any kind of invalid data. If it’s not clean, then two very important drawbacks will materialize, which you must pay close attention to.

To understand this well, let’s go back to basic maths and look at an example. Imagine you are looking at app-store rating feedback for your super-duper new adventure game. Your brand-new marvel of mobile entertainment has so far drawn five ratings.

How do your calculate a mean for this? Well, that’s the easy bit. You can use this formula:

You have five ratings there, so you would calculate the mean as:

That’s quite decent rating for your new game, even if you only got five ratings yet. Now this isn’t a perfect rating – not only because it isn’t five stars – but it gives you an overall view of what players think of your game, with a single number.

Now as you use the mean in your analysis, it is very import to understand not only its benefits, but also its caveats. The mean has in fact two important issues that you must pay close attention to:

  • It swings after extreme outliers in your data faster than a Lindy Hop dancer swings his partner into the ceiling.
  • It hides symmetric dispersion in your data better than Wally hides himself.

Extreme Outliers

In the example above, all the values were valid… for the sake of the example, of course. However, imagine that two of your ratings were, in fact, bad data. You know, the kind of data that pinches you in the back when you’re drinking hot coffee.

You can already foresee what’s going to happen when you run a mean over this, can’t you?

Yeah, you guessed it…

Ouch! 2.2 is a far different value than you had before, isn’t it? As you can see, you must pay attention to whether you are using clean data before you fully trust the results of a mean. And even when that is true, perfectly valid extreme outliers may still throw off your mean haywire when you’re not looking. There is another way to compensate for this, but first let’s take a look at the other caveat of the mean.

Symmetric Dispersion

The second caveat of the mean is that hides dispersion in your data, when that dispersion goes in opposite ways from the central value in roughly the same absolute range. But easier to understand with an example.

Consider this rating scenario now:

Here, all the values are equal to 3 and as you can infer, the mean is also 3. In business terms, you can tell that your game rates pretty average, yet it is also stable in that regard. Players doesn’t really love it but they don’t really hate it either.

Now consider this other scenario:

Hmm, a bit different, isn’t it? Now, let’s calculate the mean for this:

Still 3. The exact same as the initial example. And herein lies the caveat with the mean. If you only rely on means for your analysis, you may miss out on very important insight. In the example above, you will miss out on the extreme spread of data. The spread in the example above tells you that, while overall rating appears average, in reality:

  • Your game may have some features, behaviours or bugs that some users hate enough to rate it a miserable one star
  • And it may have some features that some users love enough to rate it a brilliant five stars.

It is this insight that can drive further investigation of the reasons people either hate or love your game, decide actions to take or lessons to learn, and on subsequent updates – or even a sequel or a future game – help you attain a higher overall rating, and with it, more popularity and profit.

So you can see that while the mean is very useful to understand the central tendancy of your data, there are some points you must pay attention to in doing so. The good news is, there are other indicators that help you out with that job. But before we go into that, you must be asking how do I calculate a mean in R anyway?

Like most basic maths in R, it’s dead easy. Just use the mean() function, go figure!

For example, and using the iris dataset again (check here to see how to get it up and running), you can calculate the mean of all petal widths with just one call and some smart sub-setting:

mean(iris$Petal.Width)

If you’re wondering what that dollar sign does, it just sub-sets the Petal.Width vector (or column, kinda) from the iris data frame (a table, kinda). This kind of sub-setting-on-the-fly is one of the neat things in R that can save you tons of writing.

Anyway, this returns you…

1.199333

Looking roughly at the data points, this sounds about right. However, if you’re picky and you pay further attention to how the data is clustered, you will notice something important…

Look at the typical petal widths of the Setosa variety…

Now look at the respective petals widths for the Virginica variety…

That’s a quite a difference, isn’t it? An order of magnitude even.

In terms of flowers on their own, this may not matter that much. However, if these values are tied to important business indicators, such as the game ratings example from before, that order of magnitude may turn out to be very important indeed.

So in the above example, you’ve seen how to calculate a mean, how to interpret it, and most importantly, what caveats to watch out for. The good news is, there are other indicators to help you out with those caveats.

Median (a.k.a…. err… Median?)

So another way you can evaluate the central tendency is with a median. A median is nothing more than the value at the exact middle of an ordered version of your vector, without regard to how far apart each value is from others. The advantage of the median over the mean is that it far more resilient to extreme values on either the low or high ends skewing your results. However, it also has its own caveats. All of this is easier to understand with an example.

Take a look at those star ratings again…

To calculate the median, you must first order these values from lowest to highest – or the other way around if you want to going against the flow, that works too! – and then choose the value right in the centre of the sequence.

There are five values here, an odd number, so the median is the third value of this sequence, the one right in the center, and that’s 4. If you had an even number of values, you would just average the two in the middle. Anyway, you can argue this is a less accurate way to evaluate a central tendency as compared to the mean, for a number of reasons. However, imagine that the first two values in this sequence were – somehow – zero. In this scenario, while the mean would go haywire immediately, the median would hold its ground.

Yep, still 4. This is the strength of the median over the mean. As long as most of your data is not made of extreme outliers, the median will hold fairly reliable, even if a bit less accurate overall, depending on how you define accurate to begin with. In fact, the median is the most resilient measure of central tendency, when it comes to datasets with dubious data. As long as less than 50% of your data is contaminated by outliers, the median will give you a fairly reliable result.

On the other hand, you can already see the one big caveat of the median, just from the example above. And it’s similar to the caveats of the mean, so no news here. Because it relies exclusively on the value at the exact middle of the sequence, the median is not sensitive at all to any dispersion in the data around that central value, even asymmetric. Whatever value happens to be in the very middle of your ordered set, that’s the value you get. Then again, evaluating dispersion is not the job of central tendency measures in the first place. That’s what spread measures are for, which is the theme of the next post.

Now, if you were paying close attention, you may be wondering something…

Hum… Wait a minute. What if I calculate both mean and median and compare them… Won’t that give me a measure of hum… skewness?”

If you thought of this, you’re getting the hang of it! Kudos to your inquisitiveness! 🙂

Now let me bear the bad news for once and say that, nope, sadly that’s generally not true. It does sound intuitive and you may even have read it in your basic maths textbook. In reality, the apparent relationship between the two is not reliable enough for you to trust in any serious analysis. But hey, cheer up! Part of knowing what works is knowing what doesn’t!

Now, at this point, I’m sure you’re already guessing how to calculate a median in R. If everything else up until now has been an one-liner, why shouldn’t this one be too? Well, you’re right on that one!

Here’s how you can calculate the median of all petal widths in the iris dataset:

median(iris$Petal.Width)

Which gives you the not so different value of 1.3. Again, perhaps a bit less accurate if you assume the data is clean, but much more robust if that assumption is, in fact, not true.

Is This Enough?

Not quite. While measures of central tendency give you a sense of where data leans toward, alone they are not enough to give you a clear portrayal of your continuous data. As you saw in the example below, central tendency does not tell you about minimum and maximum values in your data. You may have two completely different datasets with the exact same central tendency metrics. To gain a better understanding of your continuous data, you must also take a look at spread measures, which let you know how close together – or far apart – your data really is.

Make sure to keep an eye on this blog, as in the next few posts you’ll see various ways of measuring spread in your data and how to apply this knowledge in R.

See you next time!

Types of Data In R: Before You Analyse

Let me count the ways…

When you set out to analyse some good old data, you tend to come across two general types of said data: Categorical data and Continuous data.

You can pretty much infer what is which by their name, but today is a fine day as any to recall the basics.

Red, Green, Blue

Categorical data is what you use to describe discrete properties of things, in a meaningful sense, and where the possible values for those properties are few, or at least enumerable.

Think of a pumpkin’s variety or a pumpkin’s colour…

These categories are labels that have meaning to us meaning-seeking humans. These labels also don’t tend to overlap in what they cover, at least if you’re not very picky.

Of course, you can argue a Red Kuri squash and a Sugar pumpkin can look similar, but they still taste different enough that we can consider them distinct and use different names to group them.

The same goes for colour. Granted there are many shades of red, green and blue and you could make a nerdy argument for the continuity of the colour spectrum, it’s still reasonable to use the generic label of Red to describe all reddish-looking pumpkins.

This doesn’t mean you shouldn’t pick your pumpkins well, of course!

In R, you use factor variables to hold categorical data. These are more memory and performance efficient than storing strings. Many R data functions produce these by default. Many others even demand factors as input where you’d expect to pass in strings instead. Factors are the R equivalent of Enumeration types in .NET, and use the same optimization principle. You can read more about these in the documentation.

1, 2, 3…

Continuous data is what you use to describe properties whose values are infinite in nature, or at least more than you care to count or name. Continuous data is also comparable.

Think of a decent pumpkin’s weight. This can vary anywhere from a couple of lbs to a whopping 2,323 freaking lbs. At least that’s the last world record for humongous pumpkins I’ve heard of. Not sure who would buy that for their pie but I’m positive it’d make an excellent Halloween prop.

Back on point, you’ll also notice weight is comparable as it is a continuous unit. 2 lbs pounds are more than 1 lb are more than 0.5lbs are more than 0.45lbs and so on ad infinitum.

Along with weight, you can think of size, age, expiry date and even price, if you’re buying or selling.

These properties have almost limitless possible values. Almost as you couldn’t have a pumpkin bigger or older than the planet it grows on, for obvious logistical and human survival reasons. But ignore those improbabilities, and the possible values are only limited by the machine you store them on.

In R, you just use numeric (floating point) or integer variables to hold continuous data.

Choices, choices…

Whether to use a categorical or continuous variable for a given property isn’t always black-and-white.

Imagine you’re working for a mass pumpkin producer and you want to figure out which pumpkins sell more and why. Maybe you’re investigating the pumpkin’s weight. Are you interested in knowing the exact weight most people tend to favour? Or are you happy identifying some more simple intervals such as Less than 10 pounds, Between 10 & 15 pounds, More Than 15 pounds, etc.? Perhaps these intervals are more meaningful to the producers than floating point values. If so, you may find it more valuable to work with categorical data instead of continuous data.

In the end, the type of data you use depends a lot on what you want to focus on.

Anyway, that’s it for good old basics review. In the next post, I’ll talk about how you can view the structure of a given dataset in R.