Albert Pujols, one of the best baseball players in the game today, did not start the 2011 baseball season well. Prior to 2011 he had a lifetime batting average (BA) of .331 – that’s almost one hit for every at bat! However, much to the dismay of Cardinals fans (of which I am one) he had a meager .245 BA for the first month of the 2011 season.
When I first saw this statistic I wept. Had the great Pujols lost his mojo?
It occurs to me now, as I look back at my teary-eyed reaction, that I had fallen victim to a classic time series data (i.e., data measured over time) blunder. I failed to consider the time period.
I believe that when practitioners work with time series data they must always keep in mind when the series begins and ends. This is critical. These boundaries directly influence what one is trying to measure. Setting the wrong boundaries can result in biased estimators which in turn can give you faulty models and a very poor performance evaluation at the end of the year.
Consider the following table of BAs for Albert Pujols:
The table shows that when the beginning and ending of the time series occurs greatly impacts the statistic. Depending on which time period is chosen, one could argue that Mr. Pujols is worse than/roughly equal to/better than his lifetime BA.
Note: I included the best and worst 25 days to show that one can do some serious damage cherry-picking data.
Sooooo……? What data should be used? Some data? All data?
From my experience, I have found that there is no universal and tidy answer to these questions. I have learned that the correct time period is the one that best (legitimately) solves my problem. If prudent thought goes into the decision then one, I believe, is on solid footing.
In the case of El Hombre’s BA? The entire regular season would be the appropriate time period. And his mojo? He’s still got it – even during an “off” year.
I’d like to thank Sean at Sports Reference LLC (http://www.sports-reference.com/) for making my analysis so much easier.