Thursday, March 17, 2016

Retrofitting T-Ranketology

I had fun with my hobby project T-Ranketology this year. The results are over at bracketmatrix.com, and I think they're acceptable -- better than major algorithmic projections like KPI and Team Rankings, but worse than most human bracketologists. This makes sense to me, because the tournament selection is a very human affair, and it's hard for a simple model like T-Ranketology to encompass all the vagaries of that process with any real fidelity. As I put it shortly after the bracket was announced: you can't model Madness.

But I will continue to try. To that end, I ran a bunch of experiments to see if I could retrofit T-Ranketology to produce a more accurate bracket. Here are the inputs to T-Ranketology:

RPI
WAB (wins against bubble)
Elo (my E-Rank, a simple elo rating seeded by T-Rank)
"Resume"

Each team was ranked in each of these categories, then their ranks were added up to get a total score (with lower being better). That was T-Ranketology this year.

The "resume" rating needs a better name, and I've got my branding people looking into it. But it was clear to me that bracketology is impossible if you aren't paying attention to "top 50 wins" and the like. So to come up with the Resume rating I used the following point values:

Top 50 win: 10 points
other top 100 win: 3 points:
sub-100 loss: -3 points
sub-200 loss -6 points

Obviously these point values were assigned rather arbitrarily, though I did some experimentation to get a bracket that passed the eye test.

One other possible addition to the algorithm would be T-Rank itself. It's pretty clear that efficiency rating does come into the selection process, at least at the margins. For example, efficiency rating must have been the determining cause of Vanderbilt's inclusion. But it's also ignored a lot, particularly when it comes to seeding.

Anyhow, I've now done a bunch of experiments -- running thousands and thousands of brackets with different values of inputs for each of the T-Ranketology inputs -- to see what the ideal weight of the factors would be. Here are the results:

Resume x 3.5
RPI x 1.5
T-Rank x. 1.5
Elo x 0.25
WAB x 0.2

With Resume being calculated as follows:

Top 25 wins: 16 points
other top 50 wins: 13 points
other top 100 wins: 5 points
sub 100 losses: -1 point
sub 200 losses: -5 points

The original T-Ranketology got a score of 312 on bracketmatrix, by getting 65 teams right, nailing the seed on 31 and within 1 seed on 24 others.

This version of T-Ranketology gets a score of 351 (which would tie for first place this year), by getting 66 teams right, nailing the seed on 44 teams, and within 1 seed on 21 others. This algorithm gets Tulsa and Vanderbilt into the tournament, but leaves Providence and Wichita State as 2nd and 3d teams out, respectively. (St. Bonaventure is the first team out.) Saint Mary's remains in the field, though as a 10-seed instead of an 8-seed. Florida also sneaks into the tournament.

If you want to see the bracket produced by this algorithm, it's here.

Clearly, this version of the algorithm is "over-fit" to this year's results. But I think this exercise does provide some insights. Most obviously, the "resume" rank is extremely important. This is how you get Tulsa into the tournament. You have to really value those wins against "top 25" and "top 50" teams. Bad losses don't matter too much, at least not much more than they already matter for the other ranks. The elo and WAB ratings add a little, but not much, to the analysis.

So, this is the algorithm I'll go with next year. The committee will probably do something entirely different and prove, once again, that you can't model Madness.

Tuesday, March 15, 2016

Thoughts on the bracket, part 2: The Snubs

The T-Ranketology algorithm had three teams in the tournament that the selection committee found underserving: St. Bonaventure, Saint Mary's, and South Carolina.

It's pretty obvious what they have in common: they all start with the letter "S". Frankly, anti-S discrimination is as good a theory of why the committee does what it does any other. But let's dig a little deeper into their resumes, and that of the other cause célèbre, Monmouth.

Monmouth

T-Ranketology was not surprised by Monmouth's exclusion, as they were the 12th team out according to the algorithm. This is because they got killed in the "resume" column because of their three bad losses to sub-200 teams.

Monmouth is a tough case because they had the four great wins in the non-conference, against UCLA, USC, Notre Dame, and Georgetown -- all away from home. Since high-major teams have no incentive to play true mids or low-majors on the road, the only path for a team like Monmouth to an at-large bid is go giant slaying on the road, and that's exactly what they did.

Unfortunately, the UCLA and Georgetown wins ended up not looking so great in the committee's eyes, because Georgetown was a sub-100 RPI team and UCLA was 99th. (Indeed, if they'd lost to Georgetown that would counted as a "bad loss"!) This is true even though those were both true road games, which makes them impressive wins by any measure, except any measure the committee pays attention to.

In the end, it was the three sub-200 losses that killed them. I'm pretty sure no team has ever gotten an at-large bid with three losses in that category. This is somewhat unfair to Monmouth, of course, because most at-large contenders do not play very many sub-200 teams on the road. As a result, we don't have an intuitive feel for how often at-large contenders should really lose these games. Monmouth played 11, and they went 8-3. That's not good, but how bad is it?

Easy answer: too bad. I think Monmouth is in the tournament if they go 9-2 in those games. But it was three strikes you're out.

South Carolina

South Carolina was the last team in according to T-Ranketology, and no one is weeping over their omission from the field. They played a crappy schedule, which got them out to a 14-0 start. They were even 20-3, but went just 3-6 in their last nine games. Although their schedule was weak, they actually performed admirably against it, compiling +2.0 WAB, which means they won two more games than you'd expect an average bubble team to win. But (as we'll see) the committee comes down hard on teams that didn't "challenge" themselves during the non-conference, and South Carolina's OOC SOS was 271st according to the RPI.

Since South Carolina is a major conference team that has no excuse for playing such a weak schedule, and they were right on the bubble by all metrics, no one cares about saying sayonara to South Carolina. Hmm, maybe I should write a song called "Sayonara, South Carolina."

Saint Mary's

In my opinion, Saint Mary's was the real snub this year. T-Ranketology had them into the field easily as an 8-seed, and they were in a majority of final brackets at bracketmatrix.com. They had a weak nonconference schedule, but Seth Burn has already detailed how well they performed against the schedule they played, in terms of Wins Against Bubble. They also did well in other metrics traditionally associated with good tourney resumes, such as elo.

Ultimately, they were done in by their lack of "good wins." Their record against the top 100 was great -- 6-3, but the committee cares most about number of wins, not winning percentage. And in the all important "wins against top 50" they had just two. And both those were against Gonzaga, which only squeaked into the top 50 after they beat Saint Mary's in the WCC championship game.

This is a case, I think, where the committee was forced to come face-to-face with the absurdity of its own metrics. Heading into the game against Gonzaga, Saint Mary's had a blank resume, highlighted by zero top-50 wins. They lost that game, convincingly. But now, because of that loss, they had an infinitely better resume, with two top-50 wins. How could a loss possibly improve their resume so much?!

That's a little thing called cognitive dissonance, Jack.

The committee did what anyone does when experiencing cognitive dissonance: it moved on as quickly as possible. Buh-bye, Saint Mary's.

St. Bonaventure

The Bonnies were the last of the S-nubs, and the most surprising, as they were in most every final bracket. But upon examination, their big calling card was a high RPI. The rest of their resume metrics were bubblicious, or worse: 49th in elo, 50th in "resume" (good wins minus bad losses), and 69th in WAB. I shed no tears for St. Bonaventure. Indeed, their exclusion is another sign that raw RPI is (appropriately) not much of an independent factor in the deliberations.




Monday, March 14, 2016

Thoughts on the bracket selection, part 1

I followed the "bracketology" debate more closely than usual this year, for two reasons: (1) until recently, the Badgers were a bubble team (at best), and (2) I used the power of T-Rank to produce an objective bracket prediction (T-Ranketology).

I was satisfied with T-Ranketology's results. It missed three at-larges: Syracuse, Vanderbilt, and Tulsa (instead of Saint Mary's, St. Bonventure, and South Carolina). This was about average for the brackets over at bracketmatrix.com, and if you look at the final "consensus" bracket, T-Ranketology would have had 67 of 68 (with only South Carolina not being a consensus tourney team). Seeding predictions were pretty good, too. All in all, not bad for an algorithm.

There's a lot of hootin' and hollerin' about the committee's selections, and I agree with much of it. But I wanted to take a look at the six teams T-Ranketology was wrong on, plus one other, to see maybe what the committee was thinking and if the algorithm could be improved to reflect that thinking or lack thereof.

Syracuse

Syracuse's overall resume was not great, but the one big trump card they had was a road win at Duke, which is worth like a million resume points. Add a couple other decent (early) wins, and they're left ranked No. 32 in T-Ranketology's "resume" ranking, which is an attempt to simulate the committee's fixation with "top 50" wins, etc. They were also 50th in T-Rank's "wins against bubble" rating, so right on the bubble there. But they were very poor in basic RPI, and had tanked coming down the stretch, leading to a bad elo ranking (which usually pretty well approximates current "sentiments" about how good a team is). All in all, this left Syracuse the sixth team out in T-Ranketology.

But I'm not going to worry about fiddling with T-Ranketology to get Syracuse into the field, because they were clearly a special case.
The committee chairman had been saying for weeks that Syracuse was going to get special consideration because Syracuse played like shit when Jim Boeheim was serving his 9-game suspension in the middle of the season. Basketball people can look at that argument and see the absurdity on its face -- Syracuse lost five games without Boeheim, and even if you think he is super-god-coach, they almost certainly lose 4 if not all 5 of those games with him. But the committee is not composed of basketball people, so alas. In any event, the handwriting had been on the wall: barring a monumental collapse (which almost happened) Syracuse was going to be in the tournament. If I had been manually fiddling with the T-Ranketolgy bracket, I'd have put Syracuse in for sure.

Vanderbilt

Vanderbilt was the eighth team out in the final T-Ranketology, mainly because it was not in the top 50 of any of the four resume-based metrics the algorithm considered. On the other hand, unlike Syracuse, it was not terrible in any of those metrics, finishing top-70 in all of them.

Clearly Vanderbilt got in because of its efficiency rating. Vandy finished the season ranked 27th in the Kenpom ratings, and top-30 teams always get in nowadays. I think this is why they're slotted in the play-in round against another Kenpom darling, Wichita St.: the committee put two good teams with bad resumes into the ring against each other and said, "If you're so good, prove it."

When I first started T-Ranketology it was loosely modeled after the Easy Bubble Solver, which just averages RPI and Kenpom ranking. Of course, I used T-Rank instead of Kenpom rating, but it was pretty much the same idea. Eventually I took out the efficiency rating component because for good or bad I think it just doesn't play a very big role in the selection or even seeding process, at least not systematically. But I think there's good evidence that it comes into play in edge cases, and I think it's pretty clear that's what got Vandy in. I may have to work in some kind of "top 30" efficiency rating bonus to account for this.

(By the way, I think Syracuse was also helped by a decent Kenpom rating, though I don't think it wouldn't have been enough without The Boeheim Excuse.)

Tulsa

Tulsa was the real shocker, what John Gasaway calls the committee's annual "grenade." But it wasn't a huge shocker to T-Ranketology, which had Tulsa just the fifth team out -- ahead of both Syracuse and Vanderbilt! Indeed, before its loss to Memphis in the AAC tourney on Friday, T-Ranketology had Tulsa the last team in the field.

Why? Like Syracuse, Tulsa scored well in the "resume" score that approximates top 50 wins, etc. This is by far the stupidest possible measure you could come up with, but I'm pleased to say that I think I've done a pretty good job of modeling this particular madness. Tulsa's inclusion in the field of 68 shows that I probably need to weigh it even a little more.

But I think there's another lesson for Tulsa's selection. The committee first gets together early in championship week to get ahead start on the process. As a result, by Friday (when Tulsa got stomped by lowly Memphis) the committee has already made some provisional decisions about which teams it thinks are good enough. I'm pretty sure that Tulsa's resume had already been found deserving by Friday. Here's the way the human mind works: once it decides something, that decision sets. It's like a boulder in a divot; you need a big shove to get it moving again. When new information comes in, we don't start at square one and reevaluate the decision with a blank slate. We say: is this new information enough of a big deal to make me go through this whole process of deciding again? Unsurprisingly, the answer is usually no. This fundamental quality of human psychology (call it laziness if you wish) got Tulsa into the tourney.

(This, by the way, is why college-football-playoff-style in-season tourney rankings are a terrible, terrible idea.)

Well, this has gotten tl;dr so I'm going to stop now. I'll try to post later today with my profound insights into the "snubs."

Thursday, March 10, 2016

Is Kenpom biased in favor of significantly lower-ranked teams? (Yes.)

***
NOTE: Kenpom changed his rating system significantly for the 2017 season (after I did this study) so the results don't necessarily apply to his current system—indeed, there's reason to think the changes he made alleviate the issues discussed below.
***


One of the big problems in college basketball—particularly with regards to NCAA tourney selection—is giving proper weight to road games. It is difficult to intuitively grasp how much harder it is to win on the road against a lesser team than it is to win at home against a better team.

This question came to a head when Greg Shaheen was asked about Wichita State’s big home win over Utah, and specifically what he thought an equivalent road win would be. He answered somewhat absurdly: “Utah.”

Seth Burn set out to answer the question using Kenpom numbers, and what he found was that Wichita State playing No. 25 ranked Utah at home is equivalent to them playing No. 109 rated New Mexico on the road. What this means is that the Kenpom system gives them the same chance (about 73%) of winning in each game.

This is kind of shocking, and my own investigation confirms that it is a correct statement about the Kenpom system. But that doesn’t mean it is necessarily a true fact about the universe. Because Kenpom could be wrong about these things. Importantly, it could be biased in favor of lower-rated home teams.

Indeed, it has been my anecdotal observation that this in fact the case: Kenpom seems to have a systematic bias in favor of significantly lower rated teams when they play higher rated teams, particularly at home. So I set out to test this hypothesis.

First, let’s look at all games that Kenpom would predict the home team to have a margin of victory of +/- one point (favored by 1 or a 1-point underdog). By definition, these will be games where the home team is lower rated than the road team, because home-court advantage is built in. Here are the results:

Games where projected MOV is +/- 1 (Kenpom)
Total games: 372
Home team avg expected MOV: 0.01
Home team actual MOV: -1.81
Home team expected win %: .500
Home team actual win %: .438

What we see in these 372 games that Kenpom would expect to be pretty much pick ‘ems is that the better team (the road team) actually won 56.2% of the time, and had an average margin of victory of +1.8 expectations.

Now, this could just be a quirk of this season—maybe the home teams are just underperforming in close games. But when I run the same test using the T-Rank algorithm, the bias pretty much disappears:

Games where projected MOV is +/- 1 (T-Rank)
Total games: 345
Home team avg expected MOV: 0.06
Home team actual MOV: -0.21
Home team expected win %: .502
Home team actual win %: ..487

This still shows a slight bias toward the (better) road team, but it is much less, and looks much more like random variance.

Next I wanted to test my impression that Kenpom particularly breaks down when the spread between the teams is larger (e.g., games like the nearly 100-spot spread between Wichita St. and New Mexico St.). So I looked at games where the home team was rated more than 50 spots lower than the road team. Using Kenpom projections, here are the results:

Games where home team is more than 50 spots lower than road team (Kenpom)
Total games: 1227
Home team avg expected MOV: -4.34
Home team actual MOV: -6.65
Home team expected win %: .345
Home team actual win %: .271

In this rather large set of games, the home team now actually performs 2.3 points worse, on average, than Kenpom’s system would project, and wins only 80% as often as projected. Compare this to the same experiment with T-Rank:

Games where home team is more than 50 spots lower than road team (T-Rank)
Total games: 1150
Home team avg expected MOV: -6.05
Home team actual MOV: -6.80
Home team expected win %: .311
Home team actual win %: .272

Again, we still see a slight bias in favor of the home team, but a much lower one than with the Kenpom system.

Overall, Kenpom has a very good record of prediction and projection, so if there is a systematic bias in these mismatches, we would expect it to be counterbalanced by good results between more evenly matched teams. And that is in fact the case:

Games where home team and road teams ranked within 50 spots of each other (Kenpom)
Total games: 1804
Home team avg expected MOV: 4.1
Home team actual MOV: 3.58
Home team expected win %: .645
Home team actual win %: .652

(For the record, T-Rank performs similarly in these games, but not quite as good.)

In this set of games, the Kenpom algorithm is extremely well calibrated. My speculation is that the shift from Kenpom 1.0 to Kenpom 2.0 made the algorithm more accurate in these (more common and more important) games, at the expense of some loss of calibration in more mismatched games. This seems worth it, and you can see why it would improve the overall performance of the system.

But I think it’s important to consider this evidence that Kenpom’s projections are systematically based in favor of significantly lower ranked teams when doing this calculation about home/road equivalencies, because the results using Kenpom are not only unintuitive, but in all likelihood just plain wrong.


For example, using T-Rank the equivalent road game to Utah at home for Wichita State is a hypothetical team between No. 83 Temple and No. 84 Hawaii. Utah is ranked 25th in both systems, so this is a great comparison: Kenpom says the road equivalent is No. 109 New Mexico St., and T-Rank says it’s No. 83 Temple or No. 84 Hawaii. Since T-Rank gets better results in these kinds of games, and its results are more in line with most of our intuitions, I’m sticking with T-Rank.



***As to why these difference between T-Rank and Kenpom exist, I believe it has to do with the different "spread" the systems create, and the different "exponents" we use to calculate the Pythagorean Expectation (Kenpom uses 11.5, T-Rank uses 10.25). The different spread is caused mainly by the fact that Kenpom much more aggressively caps margin of victory and the effect of blowouts in mismatches than T-Rank does. My hypothesis is that by mostly ignoring those mismatched games, his system ends up being less accurate in mismatched games, but the upside is that it may be more accurate in games between teams of similar quality. 

Monday, March 7, 2016

2016 Big Ten Tourney predictions

It's time for the annual Bart versus Adam Big Ten Tournament Smackdown Challenge.™

I won last year, which is what counts. Unsure of previous years.

Here goes.

Wed:
Nebraska over Rutgers
Illinois over Minnesota

Thurs:
Badgers over Nebraska
Iowa over Illinois
Northwestern over Michigan
Ohio State over Illinois

Friday:
Indiana over Northwestern
MSU over Ohio State
Badgers over Maryland
Purdue over Iowa

Saturday:
Purdue over Indiana
MSU over Badgers

Sunday:
MSU over Purdue