On the Importance of Looking at your Data
I’ve long been meaning to mess around with the Lahman database. Baseball season is right around the corner (sort of), and I thought now would be a good time to look at what I can do with this. In the process of poking around, I realized that I had encountered a very important lesson on looking at your data. I mean looking at it. Making plots, tables, and so forth. I haven’t done any real analyses of these data yet, but it seemed like a good time to post what I’ve done.
I thought I’d look at how runs scored change over time - both historically and over the course of an individual player’s career. So I’m going to start by pulling playerID, team, year, and runs from the Batting table within the Lahman database. I’ll also get playerID and year of birth from the Master table within Lahman.
But if I’m going to predict runs scored, I also need information about how old the player was - age is a crucial predictor of performance. Conventional wisdom suggests that player performance peaks in the late twenties to early thirties, and declines pretty rapidly after that. Raw age is not available, so I’ll have to compute it.
That looks about as expected. Let’s take a look at the distribution of runs scored:
Okay, also, not surprising. Most of the observations in this dataset probably come from players who had a cup of coffee and that was it. I’m not really interested in these individuals, so I think we’ll limit our sample to players who had a bit more experience. We’ll use at-bats as a proxy for this, so I’ll need to bring that into the dataframe.
Looks like another poisson distribution. However, it also looks like we can see a bump out at around 550 or so. Let’s zoom in on those, just to gratify my curiosity.
uh huh. Looks like we’ve got a mixture of distributions here. I’m not really sure how to handle this. I wonder what happens if I look at the number of games played too.
Okay, there’s some other odd stuff going on here. I’m not sure what that bump at around 40 is. Anyone else? Regardless, we’re seeing the same kind of bump out near the end. The corresponding bump in AB should represent players who basically played a full season. Let’s see what happens to the AB distribution if we select players who had more than 150 games.
Oh man, look at that distribution! I suppose I could cut off some of those lower values, but I think this looks pretty nice! Now, let’s look at runs scored.
Mystifying. How could someone play 150 games without scoring a single run? Let’s take a look at a few observations.
Row 569 belongs to a player named Benny Agbayani, who played for the Red Sox in 2002 (as well as the Rockies). Baseball-reference tells me that he played in 61 games, had a total of 154 ABs that year, and scored a total of 15 runs. So clearly this data isn’t right. Let’s look at the the original dataframe to be sure I didn’t do something wrong when I extracted this.
That confirms it. I’ve goofed somewhere. Let’s track this down.
That’s fine. Next thing I did was:
Next:
AH! That worked okay, but for some reason, the values rows have been shuffled around a little bit. Whereas before, the dataframe was organized alphabetically by player ID, and chronologically by year within player, now the years have been shuffled. I can’t really see any logic to the ordering within player. Maybe reverse chronologically? Regardless, this is an easy fix. I’ll go ahead and recompute the age, and then extract ABs and games played in a way which will assign the values correctly. To do this, I’m going to use Hadley Wickham’s plyr package.
Okay. Mucho mejor! Let’s re-examine those distributions which had looked so good before. This is the bump in ABs.
Not too much of a difference, but it is slightly different. Next, limit the data to only those with more than 150 games played:
Nice! That even might be slightly more symmetric than the original one. Alright, let’s look at the runs scored for this group.
Man that’s gratifying. Look at that distribution. Just look at it! And this leaves us with a good bit of data too, weighing in with 3914 observations. I’d also made a shiny plot here, but I decided that I was spending too much time trying to find some way to get it to display anywhere other than on my local machine. If you really want, run all the code I’ve pasted above, plus the stuff down below here and you can see a nice histogram of runs scored, given some minimal qualifier of games played, from 1 to 162.
So, what’s the lesson here? Just to make sure you’re looking at your data. At every step of the way. You never know when you’ll have done something you didn’t intend to, or when some variable looks much different from the way you think it should. In the former case, you’ll need to retrace your steps to find the problem. In the latter case, you may have to rethink your analyses.