Let me count the ways…
When you set out to analyse some good old data, you tend to come across two general types of said data: Categorical data and Continuous data.
You can pretty much infer what is which by their name, but today is a fine day as any to recall the basics.
Red, Green, Blue
Categorical data is what you use to describe discrete properties of things, in a meaningful sense, and where the possible values for those properties are few, or at least enumerable.
Think of a pumpkin’s variety or a pumpkin’s colour…
These categories are labels that have meaning to us meaning-seeking humans. These labels also don’t tend to overlap in what they cover, at least if you’re not very picky.
Of course, you can argue a Red Kuri squash and a Sugar pumpkin can look similar, but they still taste different enough that we can consider them distinct and use different names to group them.
The same goes for colour. Granted there are many shades of red, green and blue and you could make a nerdy argument for the continuity of the colour spectrum, it’s still reasonable to use the generic label of Red to describe all reddish-looking pumpkins.
This doesn’t mean you shouldn’t pick your pumpkins well, of course!
In R, you use factor variables to hold categorical data. These are more memory and performance efficient than storing strings. Many R data functions produce these by default. Many others even demand factors as input where you’d expect to pass in strings instead. Factors are the R equivalent of Enumeration types in .NET, and use the same optimization principle. You can read more about these in the documentation.
1, 2, 3…
Continuous data is what you use to describe properties whose values are infinite in nature, or at least more than you care to count or name. Continuous data is also comparable.
Think of a decent pumpkin’s weight. This can vary anywhere from a couple of lbs to a whopping 2,323 freaking lbs. At least that’s the last world record for humongous pumpkins I’ve heard of. Not sure who would buy that for their pie but I’m positive it’d make an excellent Halloween prop.
Back on point, you’ll also notice weight is comparable as it is a continuous unit. 2 lbs pounds are more than 1 lb are more than 0.5lbs are more than 0.45lbs and so on ad infinitum.
Along with weight, you can think of size, age, expiry date and even price, if you’re buying or selling.
These properties have almost limitless possible values. Almost as you couldn’t have a pumpkin bigger or older than the planet it grows on, for obvious logistical and human survival reasons. But ignore those improbabilities, and the possible values are only limited by the machine you store them on.
In R, you just use numeric (floating point) or integer variables to hold continuous data.
Whether to use a categorical or continuous variable for a given property isn’t always black-and-white.
Imagine you’re working for a mass pumpkin producer and you want to figure out which pumpkins sell more and why. Maybe you’re investigating the pumpkin’s weight. Are you interested in knowing the exact weight most people tend to favour? Or are you happy identifying some more simple intervals such as Less than 10 pounds, Between 10 & 15 pounds, More Than 15 pounds, etc.? Perhaps these intervals are more meaningful to the producers than floating point values. If so, you may find it more valuable to work with categorical data instead of continuous data.
In the end, the type of data you use depends a lot on what you want to focus on.
Anyway, that’s it for good old basics review. In the next post, I’ll talk about how you can view the structure of a given dataset in R.