I recently found a pretty interesting dataset I thought I’d examine a little bit. I took the following data from menustat.org. You can easily download the same data by using their ‘search’ function, and having it return all data for all available years. Once done, there’s a button that lets you export the data as a csv.
The size of this dataset also means that there were some pieces that I wrote off the cuff which turned out to be far too slow. Changing things around a little bit resulted in a substantial speed-up. I’ll take us through some data prep and then show two pieces of code that do the same thing, one of which operates significantly faster. Typically, I don’t deal with datasets where thinking about speed-up offers much payoff - everything runs fast when you’re only dealing with 150 observations. In fact, I’ve previously mentioned how this idea of “powerful” code is something I’m not too concerned about. However, due to size, the present case is a bit different.
Read the data
Aw damn. Well, that didn’t work. Why not? The error message says that there’s a mismatch between the number of columns, and the number of names it has for the columns. Usually, this function reads the column names from the first line of the csv document, so I’m guessing that something is wrong with the first line of the csv. Opening it up in a text editor shows the following:
See the big chunk of text next to the 1 on the lefthand side? That’s all the stuff that appears in the first line, and we can see that it’s just kind of a description of the file. We can get rid of that. Line 2 seems to have the columns we want. While we’re at it, we can also see that there’s a bunch of empty cells that look like this: “ -“. When we read this into r, it will be represented as ‘\t-‘. There are also some spots where there’s a tab without the dash (i.e. ‘ ‘). We’re gonna get rid of those too. So we’re going to take a slightly different approach from just reading in the csv:
Read the file, line by line as one big vector
Remove the first line.
Replace any ‘ -‘ or ‘ ‘ with a simple blank cell.
Turn the vector into a dataframe.
That last line takes a while - about a full minute on my desktop machine. If there’s a way to speed that up, I don’t know it. At any rate, now we’ve got a big, beautiful datafile for our menus data! Let’s get some information about it:
We’ve got 60,238 observations across 52 different variables. Let’s get a little information about those variables.
We can now see the names for our variables. Looks like there’s one for restaurant, one for food category, one for the item, and then we have a bunch of variables which are repeated for each of 3 years: 2014, 2013, 2012. These variables describe the item, give us a serving size, and then the nutrient information (e.g. calories, fat, carbs, protein, etc.). We can also see that everything, with a couple of exceptions, is stored as a factor, which isn’t ideal. Let’s replace some of these things with characters:
Clean the data
Now, in order to do any serious quantitative analyses, we should convert these characters to numeric variables. Unfortunately, because of the way the data is represented, there are lots of values which say something like ‘25-30’ (e.g. for total fat in a serving) or ‘<1’. When we convert these variables, these observations will be lost. We could take some steps to preserve them, but it isn’t clear what such values should be replaced by, so we’ll just leave them as NA.
The next thing I’d like to do is reshape this data a bit. Right now, we’ve got three years of observations for each menu item, and each year is on the same row (we have multiple variables that are measured for each year as well). I’d like to have this rearranged such that there’s one row for each year. Also known as ‘tidy data’. The first thing I need to do is to melt the dataframe. I do that below and print out 5 random rows of data.
The first thing I want to attack is the second-to-last column in this new dataframe: variable. We see that it’s composed of a few pieces of information - variable measured, and year. I’d like to get those into different vectors. Here’s where I started getting annoyed at how long my initial attempt was taking.
Version 1
Version 2
Want to place a bet on which one runs more quickly? In version one, we’re performing the operation on vectors which are nearly 3 million observations long (length = 2891424), but we’re doing it all at once. Version two, on the other hand, boils down these long vectors into their their unique values (as designated by the levels of the factor), performs the operation on this considerably shorter vector (length = 48), and then replaces all common values at once.
The winner?
System time version 1
System time version 2
I believe this is a prime example of why aiming to vectorize your code in R is a good thing. I use a hedge there because I’m not quite sure if this is really an example of vectorization. I mean, sure, I applied the regular expressions to the levels of a vector (each of which are themselves vectors), and replaced the original with the modification. But in version 1, was I also not applying the regular expression to the values of a vector (even if that vector was considerably longer)?