Friday, September 30, 2016

Kenpom esoterica

As you may be aware, I've spent the college basketball offseason upgrading and backfilling the T-Rank website. I've now got player stats back to 2009-10 and team stats back to 2008-09.

For quality control, I checked some of the results against the stats on From 2013 forward, they are basically identical. But for 2012 and earlier, there are small but systemic differences. For example, the raw points per game for each game (available on the T-Rank "results" page and the Kenpom "Game Plan" page for each team) are usually off by a point or two per 100 possessions.

Based on the nature of the discrepancies, I deduced the differences were likely caused by differences in calculated possessions. Although "possessions" is pretty much the foundation of tempo-free basketball statistics, it's not an officially kept stat, and has to be calculated from other box score stats. The basic formula that both T-Rank and Kenpom use is:

Field Goal Attempts + Turnovers + (Free Throw Attempts * .475) - Offensive Rebounds

The first thing I wanted to check was whether I had different underlying boxscore data than Kenpom was using. My box scores for those older games are from ESPN, and I spot checked them against official team boxscores to feel confident that they are correct. But I couldn't check what Kenpom was using, because he doesn't publish his box score data for games prior to 2013. I took this as a clue that his box score data for 2012 and earlier is somewhat lacking.

The main stats lacking in old box scores that can affect tempo-free statistics are team rebounds (particularly team offensive rebounds) and team turnovers. Sometimes you'll see box scores that show team totals which are just the sum of player totals, and therefore don't include team rebounds and team turnovers. Most games have a few team rebounds and a couple team turnovers, and if we don't have those stats the calculated possessions will be less accurate.

The ESPN boxscores I used do include team rebounds and team turnovers. Other old boxscores, like those available at basketball reference, do not. This gave me an opportunity to see if I could somehow calculate Kenpom's results using the incomplete boxscores.

And I did! What I figured out is that Kenpom's underlying boxscore data from that era apparently doesn't include team offensive rebounds or team defensive rebounds, but does include total rebounds that include team rebounds. So Kenpom knows how many team rebounds each team got, but doesn't know whether they are offensive or defensive team rebounds.

What he apparently did with this data was to assume half of the team rebounds were offensive and half were defensive. I am pretty much positive this is what he did, because doing this also solves another mystery, which is how Kenpom was calculating his rebounding percentages for these older games.

Just for fun, let's walk though an example picked relatively at random: Ohio State's 62-60 loss to Kentucky in the 2011 NCAA tournament. According to T-Rank, Ohio State's PPP that game was 101.6, and Kentucky's was 105.0, based on 59.05 calculated possessions per team:

Ohio St. 58 22 16 20 36 7 60 59.45 101.6
Kentucky 48 14 7 25 32 11 62 58.65 105.0
Avg: 59.05

Ohio State's offensive rebounding percentage (ORB / (ORB + opponents DRB)) was 39% and Kentucky's was 25.9%.

But if you look at Kenpom, it gives Ohio State a PPP of 99.5 and has Kentucky at 102.8 on 60 possessions. A fairly significant difference! But I can produce those numbers using the incomplete box score available at basketball reference (and at the old version of ESPN, which is secretly still accessible at ""). Here are the raw stats you can get there:

Ohio St. 58 22 10 20 36* 7
Kentucky 48 14 7 24 32* 11

I've put asterisks in the total rebound columns because that's not actually the data available on the incomplete boxscores I have found (which actually show 30 and 31, just the sum of the incomplete parts) but I'm assuming that Kenpom must have had access to that total rebound figure that included team rebounds, and I have reason to believe that data used to be available. The next step is to divide those "missing rebounds"—6 for Ohio State and 1 for Kentucky—equally into the offensive and defensive columns, yielding:

Ohio St. 58 22 13 23 36 7 60 62.45 99.5
Kentucky 48 14 7.5 24.5 32 11 62 58.15 102.8

Avg: 60.3

This exactly nails the Kenpom PPP for both teams by adding an extra 1.25 possessions per team, thanks to 2.5 fewer offensive rebounds. It also matches up with the rebounding percentages Kenpom has for this game: Ohio State at 34.7% (13 / (13 + 24.5)) and Kentucky at 24.6% (7.5 / (7.5 + 23). 

This game had an unusually large number of team rebounds, and all but one were offensive. As a result, the Kenpom possession estimation is quite a bit off (over a full possession from that calculated using complete data) and the rebounding percentages are even more skewed.

Moral of the story: always trust T-Rank.

