What is a projection?

The Book Blog

Tom Tango looks at some assumptions we make about forecasting player performance, and looks at the race for 2009 HR champ to illustrate how the high variability of performance means that our forecasts–even if we show just one number–really outline a range of scenarios. In the optimistic one, perhaps Albert Pujols plays the entire season, faces weaker than usual pitchers, and matches the scenario at the top of the forecaster’s scale.

In the most pessimistic scenario, Pujols throws his back out the day after the forecast is made, and doesn’t play all season long. So, if he might hit 50 at the top end and 0 at the bottom end, what is a good projection for Pujols?

I’m not a fan of assigning percentiles of probability, as PECOTA does, because they don’t really mean anything real. From the comments on this post of Tom’s I learned that it seems that PECOTA applies the same distribution rules to all players, which may match the knowledge we have now, but certainly doesn’t give us any more information.

To make my projections I’ve run regressions that give me a baseline formula for using the information available in a player’s past performance, modified mostly by age. The problem with this approach is that the regression uses absorbs the volatility of the sample and spreads it throughout. So, if I apply the formula to the Top 100 projected hitters, I get a projected number of at bats (and other stats) about 10 percent less than the Top 100 hitters produced the preceding year. That loss can be attributed mostly to injuries, since these are generally reliable players.

The problem is that applying this loss across the board makes all the projections look weak. No player gets 600 AB, nobody hits 40 homers, things just look wrong. This is exactly what Tom Tango’s “dumb” projection system, Marcel the Monkey, does. Marcel only looks at previous stats and age and applies its regression formula. This gives an excellent projection of where production will come from and how much production will come, but while it tests well it doesn’t look right.

Projections are pretty limited in their applicability to allied uses, like team forecasts, but they are a good way to present the information about what is expected of a player. Does he run? Does he hit for power? The projection aggregates the information we have about a player and comes up with a compromise view that helps us smooth over the ups and downs of individual statlines. But to make it look right, you have to add that 10 percent of at bats and stats back into the individual lines, even though this means projecting too much stats overall.

Nowhere did this become evident more quickly than in the chart Tom ran in his post showing Marcel’s top 13 HR projections for 2009. In the first column are the Marcel forecast HR for each player. In the second column is the player’s name. In the third column are my 2009 projected homers. In the fourth column are my projected homer totals reduced by 10 percent. In the fifth column are the actual total homers.

Marcel Proj Hitter PK Proj PK Adj Actual
40 Howard, Ryan 46 41 45
32 Rodriguez, Alex 24 22 30
32 Fielder, Prince 33 30 46
32 Dunn, Adam 37 33 38
32 Braun, Ryan 40 36 32
31 Pujols, Albert 38 34 47
31 Pena, Carlos 33 30 39
30 Thome, Jim 34 31 23
29 Dye, Jermaine 34 31 27
28 Delgado, Carlos 27 24 4
28 Cabrera, Miguel 38 34 34
28 Berkman, Lance 36 32 25
28 Beltran, Carlos 31 28 10
401 TOTAL 451 406 400

As Tom concludes, we can get the right number for the group. The real question is what do we want the projection to do? The post is well worth reading, as are the comments, if you’re interested in this murky side of the sabremetrics game.