Monday, July 31, 2017

Badgers' Big Ten schedule

The Big Ten finally released its conference basketball schedule today—well, at least the pairings. Here's the Badgers' schedule, ordered by Torvik Thrill Quotient:


Based on the current preseason T-Rank projections, such as they are, it's the 4th toughest conference slate, and T-Rank projects the Badgers to go 9-9 and tie for 5th place. 

Here's a more subjective take:

Big favorite:


Slight Favorite:

at Rutgers
at Nebraska

Pick em:

at Illinois
at Penn St.

Slight Underdog:

at Iowa
at Maryland
at Northwestern

Book an L:

at MSU
at Purdue

Based on this, we've got four games in the "should win," two games in the "won't win," and 12 games that could easily go either way. Based on that, I'd say 10 wins is the target. That would likely be enough to get into the tournament, and could well extend the "top 4" streak. If I were setting an over/under, I'd probably go with 9.5.

What do you think, Chorlton?

Saturday, July 22, 2017

How I built a (crappy) basketball win probability model

The Internet is amazing. Given that I'm a philosophy major / lawyer, I really have no knowledge or skills whatsoever. But I've been able to put together the T-Rank website by just asking the Internet how to do it. Every time I run into a problem, I just ask google how to solve it. Usually this leads me to more or less step-by-step instructions on how to solve the problem, or at least gives me enough information to figure it out.

There has been one big exception to this: for a while I have wanted to see if could use play-by-play data to build a "win probability model." Not a good win probability model, just a sort of functioning win probability model. Not for any good reason, just because. But my google searches came up empty. Not only were there no step-by-step guides, there weren't even any general instructions to set me down the right path. I was lost. Harumphf.

Nevertheless, I persisted. Although I have no independent "knowledge" or "skills" I do have an epidemiologist wife who does, so one day while we were waiting in the office of a pediatric specialist (don't worry, it was a nothingburger) I ran the problem by her. I had learned just enough by then to explain the problem somewhat coherently, and she was able to set me down the right path.

Now I thought I would fill the void and put a step-by-step guide on to the Internet, mainly so future lawyers can build their own deeply flawed basketball win probability models. Also, if anybody who knows something about this stuff wants to help me improve this, just because, that would be terrific.

Step 1: Get the data.

Okay, I'm not going to walk you through this part, but obviously if you're going to use play-by-play data to make a win probability model, you need play-by-play data. Luckily, there is play-by-play data on various websites, and using google and a little pluck you can figure out how to get it. I started acquiring this data last season to calculate the "game script" average lead/deficit stat. Unfortunately, to get a complete set of data I had to use three different sources, which leads to some problems later on... (This is not just a step-by-step guide, it's also kind of a suspense novel.)

Step 2: Make the model.

Now the easy part: make your model. Done! Thanks for reading my guide.

Now you feel my pain, folks.

Step 2, actually: Figure out what kind of model you need.

This is the point when you have to figure how what you're going to do with the data. What I eventually figured out is that I was going to run use the data to run a "logistic regression." As I understand it (and really, I don't understand it), you can use this statistical method to take various variables (like score, time left, strength of teams) to predict the likelihood of another variable (win or loss, 1 to 0).

It's one thing to know that you need to a "run a logistic regression" and quite another to actually do it. As we'll see below.

Step 3: Get PBP data ("training data") into usable form.

Here's what I did: I went through (most) of the play-by-play data I have for the past two seasons, and for every second of those games I recorded the following data:

1. Seconds remaining
2. Score difference (team 2 score minus team 1 score)
3. team 2 initial expected win percentage (based on T-Rank)
4. who won (team 2 win = 1, team 2 loss = 0)

I actually recorded some more data, but this is what I ended up using for the current model.

One thing you might notice that's missing: who has the ball. This is a big flaw in my model, and I'll discuss it more below. (Suspense!)  But for the moment I'll just say that although this is a big flaw, I don't think it really makes much difference until the last two minutes.

Another thing that's missing is home court. This is another thing I left out mainly because it was kind of a pain to figure out based on the PBP data. But, also, home-court advantage is already built in to the third variable (expected win%), so there could be kind of a double-counting problem if I included it separately. I dunno, gimme a break.

Step 4: Run the logistic regression for each second

This might not be the best way, but what I did is run a logistic regression for each of the "seconds remaining" variables (2399, 2388, ... 2, 1, 0), with score difference and initial win percentage as the variables for predicting win/loss (I don't know the proper nomenclature for discussing regression, so bear with me).

Originally I ran a single regression with time remaining as another one of the variables, but the results were unsatisfactory, particularly at the margins. For example, it was obviously wrong very early in the game — I think because linearity was being imposed, but not sure. Anyhow, running it for every second worked out pretty nicely.

As I mentioned above, saying "run a logistic regression" and actually doing it are different things, so here's how I did it: I used Python, a programming language, which has a module for doing this called LogisticRegression. Here's a link to the code.

Step 5: Test it, see if it passes the smell test

The result of this model is that you can plug in "seconds remaining" (to get the right model), score (in the form of score differential), and initial win percentage expected to get an expected win probability.

For example, here's the result using Minnesota versus Middle Tennessee in last year's NCAA tourney:
For comparison, here's the Kenpom win probability graphic for that game:

Hey, not bad!

Step 6: self-loathing

Based on comparisons like the above, I'm satisfied that the model is "good enough for hobby work." But I'm also aware that the model is flawed. I took shortcuts along the way because I was just trying to see if I could get it to work. Then once I tested it and saw that it worked reasonably well, I had very little desire to perfect it. This serves no purpose and shouldn't be relied on. :(

As mentioned above, a core flaw here is that possession is not part of this model. I actually did subsequently attempt to add possession to the model, but the results were screwy. The core problem, I think, is that I'm not parsing the PBP data correctly for possession. This goes back to the fact that when I originally acquired the data I stripped out some useful info when I saved it. It's not impossible to deduce possession from what I've got, but it's not simple either. In the end, I'm not confident that my parsing was 100% accurate, and I think that led the model-with-possessions to be unstable. 

The second problem, I suspect, is that I'm not using enough data to include possession. I'm training the model with only about 10,000 games. Adding a possession variable slices the data another way which I suspect adds some craziness.

As you can see above, though, the lack of a possession variable usually doesn't matter much. It's instructive to look at the scoreless stretch starting at the 14:00 mark of the first half. In my model, that scoreless stretch is more a less a straight line, since score is really the only variable that affects things much. In the Kenpom model, there are noticeable squiggles as possession changes hands. But, the squiggles are pretty small -- looks like about two percent change in win probability. So my model is presumably cutting that in half and is "wrong" by +/- one percent for most of the game.

Of course, this will have a big effect in late game scenarios. If you're down two with ten seconds left, whether or not you have the ball makes a big difference. My model is significantly wrong in those end-game scenarios, but based on my experimentation it still gets the gist: the team down two with ten seconds left is very likely to lose whether or not it has the ball.


There you have it, googlers, that's how I built an obviously flawed basketball win probability model. May you have better ideas and more energy!

Sunday, March 12, 2017

T-Ranketology note

Well, it's Selection Sunday and the current version of the T-Ranketology algorithm has the same at-large field as the consensus (of which it is a small part) over at Bracket Matrix.

Mission accomplished.

I say this because the point of T-Ranketology isn't to try to predict the most accurate bracket on Selection Sunday. The point is to project a reasonably plausible bracket earlier in the season, so that we can see where things are reasonably likely to end up if teams keep performing like they have been. That T-Ranketology is able to basically produce the consensus field on Selection Sunday, with most teams seeded within one line of the consensus, shows that it is "good enough" to provide those useful projections earlier in the season.


Should note somewhere, so might as well be here, that I added one tweak to the algorithm on Selection Sunday: a good record bonus. One of the notable things about the bracket the algorithm was producing over the past few weeks was that it was notably down on the three PAC-12 teams. It seemed to me that this was probably a result that those teams just had really great records in their so-so conference. Whatever you want to say about the Pac-12, it's just hard not to be impressed by a team that's 29-4 or 30-4.

But I resisted adding a good record bonus—until Gonzaga fell off the one-line. I thought it was pretty clear that Gonzaga was going to get a one seed. The only drama on the one line was whether Duke or UNC would get the ACC's slot. After winning the ACC tournament, Duke did indeed sneak onto T-Ranketology's one-line—but at the expense of Gonzaga, not UNC.

So I pulled the trigger on a record bonus—really a "few losses bonus:" Teams got one point subtracted from their score for each loss under five. In other words, four-loss teams (like Arizona and UCLA) got one-point subtracted, and teams with one loss (Gonzaga) got four points subtracted. This not only got Gonzaga back on the one line, but it was also just enough to push Arizona onto the two line, which was pretty clearly where it was going to end up.

The result was that T-Ranketology was one of the few brackets to nail both the ones and the twos. I'll take it.

Wednesday, March 8, 2017

Big Ten Tourney madness

I think Chorlton is on a cruise in the Caribbean—oh, to be childless—so it looks like he'll lose this year's Big Ten Tourney Challenge by default. Nonetheless, I'm about to spend my requisite 20 seconds thinking about this and make my picks

Before we start, here are the current T-Rank odds, first assuming no home-court advantage for Maryland:

Now, if we give Maryland a one-half home-court advantage:

Play-in games:

Ohio State over Rutgers
Nebraska over Penn State

Second round:
Nebraska over Michigan St.
Northwestern over Ohio State
Iowa over Indiana
Michigan over Illinois

I'll be rooting for either Nebraska or Penn St. to beat Michigan St. so they have to sweat things out a little on Selection Sunday. Although MSU will probably win this game, I don't have a good feeling that they have a run in them, so I'm taking them out early for funsies.

I'd like to root against Northwestern as well, mainly so their fans go through the ultimate Selection Sunday Experience (one way or the other) but when I search my soul I find that I just do not have it in me.

Michigan over Purdue
Wisconsin over Iowa
Maryland over Northwestern
Minnesota over Nebraska

My earlier upset is robbing us of a third Minnesota / MSU game, which would be interesting if it happens. Michigan vs. Purdue is probably the game I most want to happen, since we just saw Michigan's spread-offense attack pick Purdue apart—will Purdue be able to adjust? Or will Michigan just not hit shots this time? In any event, Michigan seems like a bad match up for Purdue, and it's a tough draw for the 1-seed in its opening round game (Michigan is actually the third best Big Ten team in terms of adjusted efficiency, and was second-best in conference play).

Badgers would love to get Iowa again, I think, and it's not a team I see them losing to twice in short succession.

Michigan over Minnesota
Wisconsin over Maryland

Michigan over Wisconsin

My pick of Wisconsin to the final is pure homer, but I would love to see another Wisconsin-Michigan game. They've played two really tight, interesting games this year, and the Wagner - Happ battles have been great.

There you have it, that's how it's going down. Chorlton, if you're able to rouse yourself from your quarters and shake off the piƱa colada haze, put your picks in the comments.

Sunday, February 26, 2017

Good training

I found this ad weird. I guess Karl Anthony is not living in the past. He is a very successful pro. Good to know he is still working hard to try and beat badgers.

Wednesday, February 15, 2017

Anatomy of a Loss: Northwestern

About eight minutes into the Northwestern game, I was pretty sure Wisconsin was going to win. Charlie Thomas had just hit a three, and the Badgers were up 14-6. Northwestern had been garbage on offense, relying on floaters and whatnot, which were predictably missing.

The Badgers would go on to score just 8 points on their next 19 possessions, and head into halftime down 31-22 thanks to a three-point barrage from Northwestern (including one lucky bank shot). How did this happen?

The main narrative coming out of the game was that Northwestern's aggressive double-teaming of Ethan Happ shut the Badgers down. This is true in a sense, but the real truth is that the Badgers just didn't make Northwestern pay. Happ was good, really good, the rest of the first half in handling the double teams. Problem was, the rest of the team did nothing. Let's break it down.

Possession number 1: Now up 14-8, Trice tries an ill-advised drive and misses badly. Vitto Brown corrals the long rebound, however. Eventually he gets an open 3:

Possession number 2: After another McIntosh miss, Charlie Thomas passes cross-court to Koenig for a relatively open three, which he misses. Note that Thomas's pass is low, which throws Koenig out of rhythm:

Possession number 3: The Badgers still lead 14-8 after McIntosh misses again. Happ is back in the game and gets aggressively doubled. He finds Koenig cross-court, again he misses. Again, the pass left something to be desired, so you could credit the double team for that.

Possession number 4: Hayes gets called for a ridiculous phantom double dribble after the post-entry pass is deflected. Inexcusably bad officiating. The kind of stuff that makes you embarrassed to be a fan of the sport.

Possession number 5: Another open 3 for Brown, this time he gets the shooter's roll.

Possession number 6: One of the evening's more depressing possessions, as Koenig fumbles a rebound out of bounds. I don't know if this counted as an official turnover or just a team rebound for Northwestern, but it was a harbinger of things to come.

 Possession number 7: Another turnover, this time Showalter trying to find a cutting Hayes.

Possession number 8: This will be the only shot Ethan Happ takes in this slideshow, and it's from the top of the key. He had the line, I guess.

Possession number 9: With the shot clock running down, Nigel pulls up for a long two and banks it in. Not pretty. But note that the Badgers have now extended their lead to 7 points while scoring 5 points in 9 possessions. Still had to feel pretty good at this point, as Northwestern had scored just 12 points in 12 minutes, and you had to figure Wisconsin would snap out of their doldrums soon.

Possession number 10: After a Northwestern 3, Ethan Happ turns it over trying to find Brown before the double team arrives.

Possession number 11: This was when the game really turned. On their possession, Northwestern hit a tough two point jumper and Showalter was called for a foul for boxing out too hard on the made basket. This is a call that is pretty much never made, particularly against a home team. Really weird. So Northwestern got a make-it-take-it possession, and capitalized with another bucket to tie the game after a 4-point possession. How do the Badgers attempt to take back control? Charlie Thomas in the post.

Possession number 12: Game still tied, but Northwestern played great D on this possession (with Koenig and Happ resting) and Trice turns it over late in the shot clock.

Possession number 13: Now things are starting to slip away, as Northwestern hits another three and the Badgers again turn to Charlie Thomas to stem the tide. Alas, no.

Possession number 14: After another Northwestern 3, the Badgers again get nothing on offense. Hayes tries to create off the dribble at the end of the shot clock, but mostly is trying to draw a foul. Fruitlessly, it turns out.

Possession number 15: Northwestern has now hit three straight threes and scored 16 points on 6 possessions to take a 28-19 lead. Badgers get an open three for Showalter but he comes up short.

Possession number 16: Happ beats a weak double team and finds Showalter for another open 3, this time he nails it.

Possession number 17: Happ finds Koenig for a nice open 3, but Bronson just doesn't have it tonight.

Possession number 18: Happ again finds an open man, Hayes, but he can't score.


This is the moment when it started to seem pretty likely the Badgers would lose. Northwestern goes "2 for 1" at the end of the half, which consists of McIntosh chucking up a prayer, that gets answered by the backboard. Backbreaker. 

Possession number 18: The final possession of this sad stretch. And so it ends, not with a bang, but with a whimper.

Here's one fun thing that happened during this stretch:

Saturday, February 11, 2017

Simulation Saturday Preview!!

Edited: Crap, I forgot Louisville.

Today, FOR THE FIRST TIME EVER, the NCAA will give us an early look at how the top four seed lines would look if the season ended today.  I call this "Simulation Saturday" (as opposed to the actual "Selection Sunday") and you should too. I'm getting a trademark, probably.

It will be kind of interesting to see how this shakes out. Without the benefit of full conference seasons and conference tournaments, things are really up in the air, and this is obviously a pointless made-for-tv exercise. But that's true for all sporting events.

The 1-Seeds

Right now the clear consensus among bracketeers is that Kansas and Baylor are both deserving of a 1-seed. But I think there's also a general consensus that the Committee will probably not actually award two 1-seeds to the Big 12 on Selection Sunday (unless there's absolutely no other choice). Instead, it seems likely that the leader of the ACC—whoever that turns out to be—will be given the fourth one seed (presuming that Gonzaga and Villanova are the other two).

Right now North Carolina, Louisville, Virginia, and Florida State could all make a case for being that team. (And Duke, I guess...) But things haven't shaken themselves out yet, and at this moment picking any one of those teams for the 1-line would really be an exercise in predicting which one of them comes out on top in the ACC.

So for Simulation Saturday, will the committee grant the 1-seeds based on current resumes, or will it bend its processes a bit so that this ends up looking more like the final product?

I'm predicting they will go with current resumes. The main rationale put out for this exercise is to provide a little "transparency" into the selection process, so I think they'll want to come out armed with their "record against RPI top 50" and "conference affiliation is never mentioned in that room" factoids. Accordingly, I predict the 1 seeds will be:


The 2-seeds

One of the reasons I think they'll stick to the script on 1-seeds is that there's no clear leader for elevation among the potential 2-seeds. 

You've got the four ACC teams mentioned above, but right now they haven't really differentiated themselves. 

There are three Pac-12 teams arguably in contention—Oregon, Arizona, and UCLA—but all of them lack the magical "top 50 RPI wins" that the committee loves so damn much. (More on this later.) 

Wisconsin is sitting at No. 7 in the AP poll, so you might think they'd be in the mix. But they also lack the magical top 50 wins, and will likely be punished for it. 

Two teams from the SEC, Florida and Kentucky, should make the top 16, but they're both long shots for even the two line at this point. 

One dark horse is Butler, which is up there with Baylor for most impressive resume. For example, they are an incredible 12-3 in tournament quality tests (similar to Kenpom "Category A" games), which is three more wins of that type than anyone else. Unfortunately for them, this doesn't quite translate into the committee's stupid "magic top 50 RPI" wins, where they are a mere 7-2. The committee won't pay much attention to what are actually very impressive wins like at Marquette, at Georgetown, at Utah, vs. Indiana (neutral). Butler also has lost two of three, and it has two "bad losses" (at St. John's and at Indiana State) that are always hard to evaluate. Still, Butler's 7-2 against the top 50 is pretty good, and I think there's a chance they show up higher than people are thinking.

Another contender purely on the numbers is Creighton. But they've been on somewhat of a slide recently, corresponding to their loss of Maurice Watson for the season. I think they'll be dropped at least for purposes of this exercise and used as a talking point.

Although it would be defensible, I don't think the committee is going to come out with four ACC teams on the two-line. But North Carolina and Florida State are probably locks for a 2. Louisville is only 3-5 against the RPI top 50, so I think they'll be demoted.  So, after all that, here's my guess:

North Carolina
Florida State

That final spot on the 2-line is really hard. I think the main contenders are Florida, Kentucky, Arizona, and Oregon. (With a possible "popularity contest" slot for UCLA.) Arizona and Oregon are only in contention because it seems like the Pac-12 should get a 2, but their "resumes" (as traditionally defined by the committee, anyway) are lacking. Florida has only 4 top 50 RPI wins, but it is No. 7 in the RPI and boast the 6th hardest non-conference SOS (though this is somewhat juiced by "neutral court" games played around Florida while their arena was being renovated). So I'm just taking a wild stab that Florida will be elevated to allow for the talking point of rewarding a tough non-conference schedule.

The 3-Seeds

Let's talk about the Pac-12. It has three really good teams: Arizona, Oregon, and UCLA. It has one other likely tourney team, USC, one bubble team, Cal, and one or two other okay teams (Utah and Colorado?) The rest of the conference is sort of like a mid-major division. As a result, the Big Three are lacking in quality wins, or even opportunities for quality wins. They are also at risk for some "bad losses" when they play the mid-major division on the road.

I'm seeing UCLA on the two-line in some places, and I'm a bit mystified. They are 3-3 against the magic top 50, 21st overall in the raw RPI, with the 280th ranked non-conference schedule. You can obviously make an argument that UCLA is a really good team, but not using objective metrics normally cited by the committee. I'm predicting they'll be demoted and used as a talking point.

Arizona and Oregon are slightly better than UCLA on the top-50 metric, and vastly better on non-con SOS. I think they'll be here on the three line, but could really see either of them anywhere between 2 and 4.


The 4-Seeds

The Big Ten has a problem similar to the Pac-12's, in that there are few opportunities for magic top-50 wins. The bottom of the conference is much better than the bottom of the Pac-12, but that typically doesn't matter much to the committee. So Wisconsin, with its 2-3 record against the top-50 and 246th rated non-con SOS, will likely be relegated to the 4-line for now—at best. Wisconsin might also suffer from application of the "eye test" given its recent inability to dominate inferior opponents.

Besides the teams already mentioned but not placed (UCLA, Creighton, Wisconsin), other possible contenders for the four-line are Cincinnati (great record, but lacking magic top-50 wins), West Virginia (great team, lacking some of the stuff the committee likes), Duke, and Purdue. Possibly Xavier, I guess. Can make a case for any of these teams, obviously, but here's my guess:


I'm betting Duke will get a Coach K bonus, and that UCLA will get a Hollywood bonus.

Edited to add: in my morning haze, I forgot to put Louisville in there on the three line when I demoted them from the two. Of the original fours, I'm demoting Cincinnati just cuz.