There has been one big exception to this: for a while I have wanted to see if could use play-by-play data to build a "win probability model." Not a

*good*win probability model, just a

*sort of functioning*win probability model. Not for any good reason, just because. But my google searches came up empty. Not only were there no step-by-step guides, there weren't even any general instructions to set me down the right path. I was lost. Harumphf.

Nevertheless, I persisted. Although I have no independent "knowledge" or "skills" I do have an epidemiologist wife who does, so one day while we were waiting in the office of a pediatric specialist (don't worry, it was a nothingburger) I ran the problem by her. I had learned just enough by then to explain the problem somewhat coherently, and she was able to set me down the right path.

Now I thought I would fill the void and put a step-by-step guide on to the Internet, mainly so future lawyers can build their own deeply flawed basketball win probability models. Also, if anybody who knows something about this stuff wants to help me improve this, just because, that would be terrific.

**Step 1: Get the data.**

Okay, I'm not going to walk you through this part, but obviously if you're going to use play-by-play data to make a win probability model, you need play-by-play data. Luckily, there is play-by-play data on various websites, and using google and a little pluck you can figure out how to get it. I started acquiring this data last season to calculate the "game script" average lead/deficit stat. Unfortunately, to get a complete set of data I had to use three different sources, which leads to some problems later on... (This is not just a step-by-step guide, it's also kind of a suspense novel.)

**Step 2: Make the model.**

**Now the easy part: make your model. Done! Thanks for reading my guide.**

Now you feel my pain, folks.

**Step 2, actually: Figure out what kind of model you need.**

This is the point when you have to figure how what you're going to do with the data. What I eventually figured out is that I was going to run use the data to run a "logistic regression." As I understand it (and really, I don't understand it), you can use this statistical method to take various variables (like score, time left, strength of teams) to predict the likelihood of another variable (win or loss, 1 to 0).

It's one thing to know that you need to a "run a logistic regression" and quite another to actually do it. As we'll see below.

**Step 3: Get PBP data ("training data") into usable form.**

**Here's what I did: I went through (most) of the play-by-play data I have for the past two seasons, and for every second of those games I recorded the following data:**

1. Seconds remaining

2. Score difference (team 2 score minus team 1 score)

3. team 2 initial expected win percentage (based on T-Rank)

4. who won (team 2 win = 1, team 2 loss = 0)

I actually recorded some more data, but this is what I ended up using for the current model.

One thing you might notice that's missing: who has the ball. This is a big flaw in my model, and I'll discuss it more below. (Suspense!) But for the moment I'll just say that although this is a big flaw, I don't think it really makes much difference until the last two minutes.

Another thing that's missing is home court. This is another thing I left out mainly because it was kind of a pain to figure out based on the PBP data. But, also, home-court advantage is already built in to the third variable (expected win%), so there could be kind of a double-counting problem if I included it separately. I dunno, gimme a break.

**Step 4: Run the logistic regression for each second**

**This might not be the best way, but what I did is run a logistic regression for each of the "seconds remaining" variables (2399, 2388, ... 2, 1, 0), with score difference and initial win percentage as the variables for predicting win/loss (I don't know the proper nomenclature for discussing regression, so bear with me).**

Originally I ran a single regression with time remaining as another one of the variables, but the results were unsatisfactory, particularly at the margins. For example, it was obviously wrong very early in the game — I think because linearity was being imposed, but not sure. Anyhow, running it for every second worked out pretty nicely.

As I mentioned above, saying "run a logistic regression" and actually doing it are different things, so here's how I did it: I used Python, a programming language, which has a module for doing this called LogisticRegression. Here's a link to the code.

**Step 5: Test it, see if it passes the smell test**

**The result of this model is that you can plug in "seconds remaining" (to get the right model), score (in the form of score differential), and initial win percentage expected to get an expected win probability.**

For example, here's the result using Minnesota versus Middle Tennessee in last year's NCAA tourney:

For comparison, here's the Kenpom win probability graphic for that game:

Hey, not bad!

**Step 6: self-loathing**

Based on comparisons like the above, I'm satisfied that the model is "good enough for hobby work." But I'm also aware that the model is flawed. I took shortcuts along the way because I was just trying to see if I could get it to work. Then once I tested it and saw that it worked reasonably well, I had very little desire to perfect it. This serves no purpose and shouldn't be relied on. :(

As mentioned above, a core flaw here is that possession is not part of this model. I actually did subsequently attempt to add possession to the model, but the results were screwy. The core problem, I think, is that I'm not parsing the PBP data correctly for possession. This goes back to the fact that when I originally acquired the data I stripped out some useful info when I saved it. It's not impossible to deduce possession from what I've got, but it's not simple either. In the end, I'm not confident that my parsing was 100% accurate, and I think that led the model-with-possessions to be unstable.

The second problem, I suspect, is that I'm not using enough data to include possession. I'm training the model with only about 10,000 games. Adding a possession variable slices the data another way which I suspect adds some craziness.

As you can see above, though, the lack of a possession variable usually doesn't matter much. It's instructive to look at the scoreless stretch starting at the 14:00 mark of the first half. In my model, that scoreless stretch is more a less a straight line, since score is really the only variable that affects things much. In the Kenpom model, there are noticeable squiggles as possession changes hands. But, the squiggles are pretty small -- looks like about two percent change in win probability. So my model is presumably cutting that in half and is "wrong" by +/- one percent for most of the game.

Of course, this will have a big effect in late game scenarios. If you're down two with ten seconds left, whether or not you have the ball makes a big difference. My model is significantly wrong in those end-game scenarios, but based on my experimentation it still gets the gist: the team down two with ten seconds left is very likely to lose whether or not it has the ball.

**Conclusion**

There you have it, googlers, that's how I built an obviously flawed basketball win probability model. May you have better ideas and more energy!

As I have the only degree more useless than philosophy, (political science) I will leave the regression analysis help to the internet nerds. Sounds like a fun project though.

ReplyDeleteGreat work Bart. And heck, even with multiple degrees in nuclear engineering and a math major, this shit is hard.

ReplyDeleteThanks Nuke!

Delete