Before you begin to play around with a dataset, you must first understand what data types it has.
Earlier in this series, I reviewed two main types of data:
- Continuous data.
- Categorical data.
Today, you’ll see how to quickly inspect a dataset in R, and how these two types of data become relevant during that process.
The first thing to do upon entering the fun world of playing around with data is to figure out what sort of data you’re dealing with. What kind and measure of observations does your dataset contain? What facts, attributes, features or variables do those observations expose?
Side Note: For those comfortable with SQL, think rows and column types, but don’t get too comfortable! With R, a table isn’t all that tabular, so be careful on how you call those!
The answers to such questions are not always obvious. Depending on your work specifics, you may or may not have control over, or prior knowledge of, the structure or design of the dataset. You might have to work with a dataset another team has made available to you, or you may not remember what the contents of a given dataset you’ve previously worked with are.
But regardless of the reason for your knowledge gap, the good news is that you can make use of a couple of basic R functions to discover what it is that you’re looking at extremely quickly.
For the examples below and in the following posts, I’m making use of the Iris dataset. This dataset comes with R and contains four different measurements pertaining to 150 flowers, with 50 flowers per species of Iris (Setosa, Versicolor, Virginica).
The Iris dataset was first introduced in the Ronald Fisher’s 1936 paper The Use of Multiple Measurements in Taxonomic Problems, and has since become a popular test case for a number of statistical classification techniques. (You can also produce funky scatter plots with it — but more on that later!)
The measurements (columns, kinda) for each observation (rows, kinda) are the length and width of the petals and sepals of each flower. The dataset doesn’t contain actual pictures of the flowers… but then, it was created in 1936!
In R, you can find the Iris dataset in the datasets package which comes bundled with a regular installation of R. There’s nothing to it! All you have to do is find it and then load the package into your session, which you can do using the library() function.
This will produce no output but will make a number of popular test datasets available to you, including the Iris dataset. You can take a look at the stuff available in the package:datasets environment. Here’s the environment window in R Studio:
See that iris variable there? That’s what you want.
Next, and only if you want, you can use the data() function to load the Iris dataset into your Global Environment.
This isn’t a necessary step, but doing things this way means you’ll be able to see the isis variable in your Global Environment, which could be useful depending on what you’re doing.
Columns – Kinda
Now that you have the dataset loaded, it’s time to figure out what it consists of to begin with. To discover the structure of any dataset — or any variable, for that matter — you can use the str() function:
This outputs a helpful description of the structure of the dataset, along with some sample values:
'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
On each line you can see a description for each variable of this dataset, along with its data type and a list of sample values. You can also see that the dataset has 150 observations. And you can see that the variable type itself is a data frame. In R, a data frame is nothing but a list of vectors of equal length — a table, for all intents and purposes; and so, in this case, observations and variables do map to rows and columns as they would in a common database table.
Also, notice how the first four attributes (lengths and widths) are num (numeric), while the species is a factor. This is where that continuous vs. categorical data chat from my previous post ties in (so you should probably read that before proceeding!).
The first four numerical variables hold continuous data, since widths and lengths are continuous units. The species factor variable contains categorical data because it describes a single, non-overlapping species of Iris. A flower cannot belong to more than one species at the same time. (Unless it’s some sort of X-MEN flower, of course, but that’s beyond the scope of this post!)
Anyway, the way you analyse these attributes will differ depending on what role they play, which is why it’s important to keep this in mind.
Well, now that you’re a master of the str() function… chances are that you can forget about it for the most part! That is, at least, if you’re running a capable R IDE. In R Studio, for example, you can see the structure of any variable by clicking on the little arrow next to it.
Still: nice to know what’s running under the hood!
Rows – Kinda
Last, but not least, if you don’t quite like the presentation of str(), you can get a sample of results back in tabular format.
For this you can use the head() function, like this:
This will return a couple a rows from the dataset, nicely formatted as a table:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
If this isn’t enough for you, then you can request more rows: for example, the top 100.
head(iris, n = 100)
Note that head(), as opposed to str(), won’t show you the variable types along with the names. It will, however, return a proper dataset of its own, which you can use for other purposes.
OK! That’s it. As I said, there’s nothing to it. Inspecting datasets in R is dead easy.
Make sure to keep an eye on my blog, as, in the next few posts, I’m going to start analysing this dataset more extensively in regards to its continuous and categorical data.
See you next time.