Last time, I examined the relationship between macronutrients and total calories in the data obtained from menustat.org. If you’ll recall, there was something curious about a model I fit. While my coefficients were close to what would be expected, they were not perfect, and the way in which they weren’t perfect was intriguing. Here is the model:
Note that the estimates in the model are very close to what we would expect, based on the lawful relationship between each macronutrient and calories:
Fats: 9 kcal/gram
Protiens: 4 kcal/gram
Carbohydrates: 4 kcal/gram
However, it isn’t perfect. Indeed, the estimates all move in the same direction one would expect if there was some kind of social desireability bias at play (i.e. higher than expected protein, lower than expected carbs & fats). To investigate this more closely, I created an index of the random effects which is a measure of how much each restaurant’s nutrition information is biased in socially desireable ways. I refer to the index as ‘reporting tendency’. Negative numbers represent nutrition information which is more socially desireable
Note that we’ve got a couple of clear outliers. It could be these restaurants which are driving this pattern. Perhaps this is another instance of the data being mis-entered. Let’s dive in and see who these two are.
Remember that these random effects estimates have all been standardized, so when you see that the estimate for fat is -8.6 as it is for Round Table Pizza, that means that the random effect of total fat for Round Table is 8.6 standard deviations below the mean! Clearly, we should examine the raw data for these two companies. Since they both have quite a large number of items (789 for Godfather, 1266 for Round Table), I’m going to place them into their own dataframe and then look at a scatter of each macronutrient against total calories.
This doesn’t look like anything too fishy. I’m guessing that the huge effect of carbs in Godfather’s is being driven by the outlier way out ~175 carbs and 125 calories. Ditto for Round Table being driven by the outlier out near 150 grams of fat. Interestingly, the plot for carbohydrates seems to have two distinct groups in both chains, but it is esspecially pronounced for Round Table. Note all the data points clustered together in a line that is below the larger cluster. It looks like if we define a line that runs through the origin and the point (x=50, y=250), we can just grab the values which are below that for a sense of what these items are. Let’s give that a shot.
That looks okay. Let’s get everything smaller than the values in that line.
Ah. Basically, this is soda, which is typically made up of nothing but carbohydrates.
Okay, let’s remove the two crazy outliers and see about refitting the model.
Okay, items 72354 and 43760. We can remove those without too much difficulty. Let’s just double check that we wont do too much damage by removing all carb and fat entires with those values.
Checks out okay - the other years are all NA anyway. Let’s remove and refit the model
Looks a bit better. I guess I was making a mountain out of a mole hill. We could continue to play the remove-the-outlier game all day, I’m sure, but I’m not really interested in doing it.