Central Tendency In R: What, How & Why

The first step in analysing continuous data is understanding its Central Tendency, including what caveats there are and why they matter.

Earlier in this series, you saw the difference between continuous and categorical data and how to inspect a dataset to deduce which one is which. This is important because you must analyse these types of data in different ways.

Once you get comfortable with those themes, it’s time to get some good old analysing going. Today you’ll see some simple ways of getting an initial feel of all that data you hold in your hands, just begging for your attention.

As you saw before, analysing continuous data is a bit different from analysing categorical data. There are distinct techniques and indicators that you can apply to each type of data, so it’s best to focus on each type in its own way.

This post covers how to start analysing continuous data by looking at its Central Tendency, what things to watch out for and why they matter.

Aspects, Never Too Many

To get an initial feel of how a set of values – and let’s start calling this a vector, by the way! – of continuous data types looks like, you can start by focusing on two facets of it. These are not the only facets, but are a reliable starting point.

One is the Central Tendency. This tells you about what values does the data in your vector tend to converge to.

The other one is the Spread. This tells you how dispersed the data in your vector is around that Central Tendency.

There are different ways of evaluating each aspect and you must to think about the most appropriate ones for the dataset at hand.

Let’s take a look at a couple of basic measures of central tendency to get you started.

Mean (a.k.a. Average)

One way you can evaluate the central tendency is with a mean, which you may also know as either arithmetic mean or average. A mean gives the exact value around which the set of all values in your dataset converge to. A mean is a very accurate way of finding the central tendency of a vector, assuming your data is clean, meaning it has no unexpected outliers or any kind of invalid data. If it’s not clean, then two very important drawbacks will materialize, which you must pay close attention to.

To understand this well, let’s go back to basic maths and look at an example. Imagine you are looking at app-store rating feedback for your super-duper new adventure game. Your brand-new marvel of mobile entertainment has so far drawn five ratings.

How do your calculate a mean for this? Well, that’s the easy bit. You can use this formula:

You have five ratings there, so you would calculate the mean as:

That’s quite decent rating for your new game, even if you only got five ratings yet. Now this isn’t a perfect rating – not only because it isn’t five stars – but it gives you an overall view of what players think of your game, with a single number.

Now as you use the mean in your analysis, it is very import to understand not only its benefits, but also its caveats. The mean has in fact two important issues that you must pay close attention to:

  • It swings after extreme outliers in your data faster than a Lindy Hop dancer swings his partner into the ceiling.
  • It hides symmetric dispersion in your data better than Wally hides himself.

Extreme Outliers

In the example above, all the values were valid… for the sake of the example, of course. However, imagine that two of your ratings were, in fact, bad data. You know, the kind of data that pinches you in the back when you’re drinking hot coffee.

You can already foresee what’s going to happen when you run a mean over this, can’t you?

Yeah, you guessed it…

Ouch! 2.2 is a far different value than you had before, isn’t it? As you can see, you must pay attention to whether you are using clean data before you fully trust the results of a mean. And even when that is true, perfectly valid extreme outliers may still throw off your mean haywire when you’re not looking. There is another way to compensate for this, but first let’s take a look at the other caveat of the mean.

Symmetric Dispersion

The second caveat of the mean is that hides dispersion in your data, when that dispersion goes in opposite ways from the central value in roughly the same absolute range. But easier to understand with an example.

Consider this rating scenario now:

Here, all the values are equal to 3 and as you can infer, the mean is also 3. In business terms, you can tell that your game rates pretty average, yet it is also stable in that regard. Players doesn’t really love it but they don’t really hate it either.

Now consider this other scenario:

Hmm, a bit different, isn’t it? Now, let’s calculate the mean for this:

Still 3. The exact same as the initial example. And herein lies the caveat with the mean. If you only rely on means for your analysis, you may miss out on very important insight. In the example above, you will miss out on the extreme spread of data. The spread in the example above tells you that, while overall rating appears average, in reality:

  • Your game may have some features, behaviours or bugs that some users hate enough to rate it a miserable one star
  • And it may have some features that some users love enough to rate it a brilliant five stars.

It is this insight that can drive further investigation of the reasons people either hate or love your game, decide actions to take or lessons to learn, and on subsequent updates – or even a sequel or a future game – help you attain a higher overall rating, and with it, more popularity and profit.

So you can see that while the mean is very useful to understand the central tendancy of your data, there are some points you must pay attention to in doing so. The good news is, there are other indicators that help you out with that job. But before we go into that, you must be asking how do I calculate a mean in R anyway?

Like most basic maths in R, it’s dead easy. Just use the mean() function, go figure!

For example, and using the iris dataset again (check here to see how to get it up and running), you can calculate the mean of all petal widths with just one call and some smart sub-setting:

mean(iris$Petal.Width)

If you’re wondering what that dollar sign does, it just sub-sets the Petal.Width vector (or column, kinda) from the iris data frame (a table, kinda). This kind of sub-setting-on-the-fly is one of the neat things in R that can save you tons of writing.

Anyway, this returns you…

1.199333

Looking roughly at the data points, this sounds about right. However, if you’re picky and you pay further attention to how the data is clustered, you will notice something important…

Look at the typical petal widths of the Setosa variety…

Now look at the respective petals widths for the Virginica variety…

That’s a quite a difference, isn’t it? An order of magnitude even.

In terms of flowers on their own, this may not matter that much. However, if these values are tied to important business indicators, such as the game ratings example from before, that order of magnitude may turn out to be very important indeed.

So in the above example, you’ve seen how to calculate a mean, how to interpret it, and most importantly, what caveats to watch out for. The good news is, there are other indicators to help you out with those caveats.

Median (a.k.a…. err… Median?)

So another way you can evaluate the central tendency is with a median. A median is nothing more than the value at the exact middle of an ordered version of your vector, without regard to how far apart each value is from others. The advantage of the median over the mean is that it far more resilient to extreme values on either the low or high ends skewing your results. However, it also has its own caveats. All of this is easier to understand with an example.

Take a look at those star ratings again…

To calculate the median, you must first order these values from lowest to highest – or the other way around if you want to going against the flow, that works too! – and then choose the value right in the centre of the sequence.

There are five values here, an odd number, so the median is the third value of this sequence, the one right in the center, and that’s 4. If you had an even number of values, you would just average the two in the middle. Anyway, you can argue this is a less accurate way to evaluate a central tendency as compared to the mean, for a number of reasons. However, imagine that the first two values in this sequence were – somehow – zero. In this scenario, while the mean would go haywire immediately, the median would hold its ground.

Yep, still 4. This is the strength of the median over the mean. As long as most of your data is not made of extreme outliers, the median will hold fairly reliable, even if a bit less accurate overall, depending on how you define accurate to begin with. In fact, the median is the most resilient measure of central tendency, when it comes to datasets with dubious data. As long as less than 50% of your data is contaminated by outliers, the median will give you a fairly reliable result.

On the other hand, you can already see the one big caveat of the median, just from the example above. And it’s similar to the caveats of the mean, so no news here. Because it relies exclusively on the value at the exact middle of the sequence, the median is not sensitive at all to any dispersion in the data around that central value, even asymmetric. Whatever value happens to be in the very middle of your ordered set, that’s the value you get. Then again, evaluating dispersion is not the job of central tendency measures in the first place. That’s what spread measures are for, which is the theme of the next post.

Now, if you were paying close attention, you may be wondering something…

Hum… Wait a minute. What if I calculate both mean and median and compare them… Won’t that give me a measure of hum… skewness?”

If you thought of this, you’re getting the hang of it! Kudos to your inquisitiveness! 🙂

Now let me bear the bad news for once and say that, nope, sadly that’s generally not true. It does sound intuitive and you may even have read it in your basic maths textbook. In reality, the apparent relationship between the two is not reliable enough for you to trust in any serious analysis. But hey, cheer up! Part of knowing what works is knowing what doesn’t!

Now, at this point, I’m sure you’re already guessing how to calculate a median in R. If everything else up until now has been an one-liner, why shouldn’t this one be too? Well, you’re right on that one!

Here’s how you can calculate the median of all petal widths in the iris dataset:

median(iris$Petal.Width)

Which gives you the not so different value of 1.3. Again, perhaps a bit less accurate if you assume the data is clean, but much more robust if that assumption is, in fact, not true.

Is This Enough?

Not quite. While measures of central tendency give you a sense of where data leans toward, alone they are not enough to give you a clear portrayal of your continuous data. As you saw in the example below, central tendency does not tell you about minimum and maximum values in your data. You may have two completely different datasets with the exact same central tendency metrics. To gain a better understanding of your continuous data, you must also take a look at spread measures, which let you know how close together – or far apart – your data really is.

Make sure to keep an eye on this blog, as in the next few posts you’ll see various ways of measuring spread in your data and how to apply this knowledge in R.

See you next time!

Advertisements

Are Query Plans All We Need?

In How Can Query Plans Help Us?, we looked at the nature of T-SQL and then at some of the benefits that understanding query plans can bring us. But are query plans a silver bullet? Can we overcome all optimization challenges with query plans only or are there other tools we need on our utility belt?

Continue reading “Are Query Plans All We Need?”

How To List Table Row Counts And Index Statistics

On getting an ETL performance testing system of some sort up and running, one of the Socratic questions that arises at some point is:

“When we run a workload, how much data is there in the database and how does it affect the workload itself?”

OK, so not as Socratic as “What is justice?”, as we can actually measure this one.

The starting point being, of course, to identify that data in the first place.
The snippets below provide a starting point for getting that data out of SQL Server.

Row Counts Only

The starting point can be to just list row counts of all tables in the target database.
The short example below will do just that.

-- tables only, clustered or heap, nothing else
SELECT
	S.name As SchemaName,
	T.name As TableName,
	SUM (P.rows) AS Rows
FROM
	sys.tables AS T
	INNER JOIN sys.schemas AS S
		ON S.schema_id = T.schema_id
	INNER JOIN sys.partitions AS P
		ON P.object_id = T.object_id
WHERE
	T.type = 'U' -- user tables only
	AND P.index_id IN (0, 1) -- HEAP and CLUSTERED only
GROUP BY
	S.name,
	T.name
ORDER BY
	SchemaName,
	TableName
;

Rows Counts Per Partition

Next, it can be interesting to look at row counts per partition. This might be helpful if you’re running some crazy updates on specific partitions as part of your ETL process and you’d like to check whether it makes too big a difference to run those over low populated partitions vs highly populated ones.

-- tables only, clustered or heap, counts by partition
SELECT
	S.name As SchemaName,
	T.name As TableName,
	P.partition_number AS PartitionNumber,
	P.rows AS Rows
FROM
	sys.tables AS T
	INNER JOIN sys.schemas AS S
		ON S.schema_id = T.schema_id
	INNER JOIN sys.partitions AS P
		ON P.object_id = T.object_id
WHERE
	T.type = 'U' -- user tables only
	AND P.index_id IN (0, 1)-- HEAP and CLUSTERED only
ORDER BY
	SchemaName,
	TableName,
	PartitionNumber
;

Row Counts Per Partition & Index

It can also be interesting to see if ETL performance is affected by loads on top of tables with growing indexes, both in number and size.
This can go both ways, of course, but it also prepares for the next step.

-- tables, clustered or heap, plus indexes, counts by partition
SELECT
	S.name As SchemaName,
	T.name As TableName,
	P.partition_number AS PartitionNumber,
	P.index_id AS IndexId,
	I.name AS IndexName,
	I.type_desc AS IndexType,
	P.rows AS Rows
FROM
	sys.tables AS T
	INNER JOIN sys.schemas AS S
		ON S.schema_id = T.schema_id
	INNER JOIN sys.partitions AS P
		ON P.object_id = T.object_id
	INNER JOIN sys.indexes AS I
		ON I.object_id = P.object_id
		AND I.index_id = P.index_id
WHERE
	T.type = 'U' -- user tables only
ORDER BY
	SchemaName,
	TableName,
	PartitionNumber,
	IndexId
;

Row Counts Per Partition & Index, Plus Index Statistics

If we’re not yet content with the previous data, we can roll up our sleeves by digging into the index stats themselves.
Ask yourself, how many times did you see that T-SQL-done-ETL run incredibly slow and then magically fix itself when the underlying tables got rebuilt? As in REBUILD, really.
Fragmentation can have a big impact on any query, and ETL-ities are no exception.
The example below will grab basic fragmentation stats from all indexes so you can see if that’s a possible cause of red herrings in your workload.
Be careful running this query on a production system during business hours, it can quite heavy, especially if you pump up the call to sys.dm_db_index_physical_stats to increase the scan accuracy.

-- tables, clustered or heap, plus indexes, and physical stats counts by partition
SELECT
	S.name As SchemaName,
	T.name As TableName,
	P.partition_number AS PartitionNumber,
	P.index_id AS IndexId,
	I.name AS IndexName,
	I.type_desc AS IndexType,
	P.rows AS Rows,
	PIS.alloc_unit_type_desc AS IndexAllocationUnitType,
	PIS.index_depth AS IndexDepth,
	PIS.index_level AS IndexLevel,
	PIS.avg_fragmentation_in_percent AS IndexAverageFragmentationPercent,
	PIS.fragment_count AS IndexFragmentCount,
	PIS.avg_fragment_size_in_pages AS IndexAverageFragmentSizeInPages,
	PIS.page_count AS IndexPageCount
FROM
	sys.tables AS T
	INNER JOIN sys.schemas AS S
		ON S.schema_id = T.schema_id
	INNER JOIN sys.partitions AS P
		ON P.object_id = T.object_id
	INNER JOIN sys.indexes AS I
		ON I.object_id = P.object_id
		AND I.index_id = P.index_id
	CROSS APPLY sys.dm_db_index_physical_stats(
		/* database_id */ DB_ID(),
		/* object_id */ T.object_id,
		/* index_id */ I.index_id,
		/* partition_number */ P.partition_number,
		/* mode */ NULL) AS PIS
WHERE
	T.type = 'U' -- user tables only
ORDER BY
	SchemaName,
	TableName,
	PartitionNumber,
	IndexId
;

All For All

At some point you might also need to grab a snapshot of some or all the databases in your server for workload documentation purposes.
Doing this on all databases one by one can be time-consuming.
The example below lets you skip the RBAR pain and grab the same stats as the previous example, but for all the databases you need at once.
Remember to add a WHERE+LIKE filter on the first block to select only the database name patterns you want.
Again, be weary of running this on a production system during business hours.

-- list databases to loop
DECLARE @Databases TABLE
(
	DatabaseKey INT IDENTITY (1, 1) NOT NULL PRIMARY KEY CLUSTERED,
	DatabaseName SYSNAME NOT NULL
)
INSERT INTO @Databases (DatabaseName)
SELECT name FROM sys.databases;

-- this will hold the final result
IF OBJECT_ID('tempdb.dbo.#Data') IS NOT NULL
DROP TABLE #Data;

CREATE TABLE #Data
(
	DatabaseName SYSNAME NOT NULL,
	SchemaName SYSNAME NOT NULL,
	TableName SYSNAME NOT NULL,
	PartitionNumber INT NOT NULL,
	IndexId INT NOT NULL,
	IndexName SYSNAME NULL,
	IndexType SYSNAME NOT NULL,
	Rows BIGINT NOT NULL,
	IndexAllocationUnitType SYSNAME NULL,
	IndexDepth TINYINT NULL,
	IndexLevel TINYINT NULL,
	IndexAverageFragmentationPercent FLOAT NULL,
	IndexFragmentCount BIGINT NULL,
	IndexAverageFragmentSizeInPages FLOAT NULL,
	IndexPageCount BIGINT NULL,

	PRIMARY KEY CLUSTERED
	(
		DatabaseName,
		SchemaName,
		TableName,
		PartitionNumber,
		IndexId
	)
);

DECLARE @DatabaseKey INT = 1;
DECLARE @MaxDatabaseKey INT = (SELECT MAX(DatabaseKey) FROM @Databases);
WHILE @DatabaseKey <= @MaxDatabaseKey
BEGIN

	DECLARE @DatabaseName SYSNAME =
	(
		SELECT
			DatabaseName
		FROM
			@Databases
		WHERE
			DatabaseKey = @DatabaseKey
	);

	DECLARE @SQL NVARCHAR(MAX) = '
		INSERT INTO #Data
		(
			DatabaseName,
			SchemaName,
			TableName,
			PartitionNumber,
			IndexId,
			IndexName,
			IndexType,
			Rows,
			IndexAllocationUnitType,
			IndexDepth,
			IndexLevel,
			IndexAverageFragmentationPercent,
			IndexFragmentCount,
			IndexAverageFragmentSizeInPages,
			IndexPageCount
		)
		SELECT
			''' + @DatabaseName + ''' AS DatabaseName,
			S.name As SchemaName,
			T.name As TableName,
			P.partition_number AS PartitionNumber,
			P.index_id AS IndexId,
			I.name AS IndexName,
			I.type_desc AS IndexType,
			P.rows AS Rows,
			PIS.alloc_unit_type_desc AS IndexAllocationUnitType,
			PIS.index_depth AS IndexDepth,
			PIS.index_level AS IndexLevel,
			PIS.avg_fragmentation_in_percent AS IndexAverageFragmentationPercent,
			PIS.fragment_count AS IndexFragmentCount,
			PIS.avg_fragment_size_in_pages AS IndexAverageFragmentSizeInPages,
			PIS.page_count AS IndexPageCount
		FROM
			' + QUOTENAME(@DatabaseName) + '.sys.tables AS T
			INNER JOIN ' + QUOTENAME(@DatabaseName) + '.sys.schemas AS S
				ON S.schema_id = T.schema_id
			INNER JOIN ' + QUOTENAME(@DatabaseName) + '.sys.partitions AS P
				ON P.object_id = T.object_id
			INNER JOIN ' + QUOTENAME(@DatabaseName) + '.sys.indexes AS I
				ON I.object_id = P.object_id
				AND I.index_id = P.index_id
			CROSS APPLY sys.dm_db_index_physical_stats(
				/* database_id */ DB_ID(''' + @DatabaseName + '''),
				/* object_id */ T.object_id,
				/* index_id */ I.index_id,
				/* partition_number */ P.partition_number,
				/* mode */ NULL) AS PIS
		WHERE
			T.type = ''U'' -- user tables only
		ORDER BY
			SchemaName,
			TableName,
			PartitionNumber,
			IndexId
		;
		';

	EXECUTE (@SQL);

	SET @DatabaseKey += 1;

END

SELECT * FROM #Data;