My post for round 2 of TeamRankings blogging competition is up on their blog. I give a full explanation of my NCAA basketball simulation method and flesh out my predictions for tonight’s games.
Edit: The complete post can now be found below. I moved it here in case TeamRankings changes their links at a future date.
Breaking Down Match Ups: Sweet Sixteen Game Simulations
In round 1 of the Stat Geek Idol competition, I described a procedure to simulate NCAA basketball games based on the few team statistics that really matter: shooting percentages, shot selection, turnovers per play, and offensive rebound percentage. These are basically Dean Oliver’s four factors, though I go a little more in depth. For this round, I’ll break down the simulation procedure and apply it to the Sweet Sixteen match ups. But first, how have my simulations performed so far? For comparison, I list the number of teams correctly predicted to reach the second and third rounds by a few different methods (I give a full summary on my blog):
- Take the higher seed: 22/32, 11/16
- Take the higher RPI: 21/32, 9/16
- Take the higher Pomeroy ranking: 22/32, 10/16
- Take the higher Sagarin ranking: 23/32, 10/16
- Take the team that wins majority of my simulations: 23/32, 9/16
If I forgive first round mistakes and recalculate second round match ups based on who actually played in them, 13 of 16 higher seeds won, and all the other methods get 12 out of 16 correct. Not a bad start for any of these methods, though I admit that my overall simulated champion (2 seed Missouri) is out!
How did I get my picks? There are three steps:
- Use past years to estimate the relationship between regular season efficiency stats and tournament game efficiency stats.
- Plug this year’s regular season stats for each team into the estimated relationships from step 1 to get predicted efficiency stats for each tournament game.
- Use the predicted efficiency stats to simulate each game many times, possession by possession.
Here’s a little more explanation for each step:
Step 1: Using Past Data
For each statistic that I need in the simulations, I set up a regression with the stat in question as a function of relevant regular season stats. For example, I regressed offensive turnover rate for each team in each tournament game on their regular season offensive turnover rate, their tournament opponent’s regular season defensive turnover rate, and conference affiliation (to control somewhat for strength of schedule). This regression gave me the following relationship:
tourney TOs/play = -0.07 + 0.63*(my offensive TOs/poss) + 0.54*(opponent defensive TOs/poss) + conference effects
If tournament turnover rate were just the average of my offensive turnover rate and my opponent’s defensive turnover rate, then these inputs would be evenly weighted (equal coefficients). Both inputs are certainly important (the coefficients are highly statistically significant), but my turnover rate is slightly more important. It gets a weight of 0.63, compared to a weight of 0.54 for my opponent. It’s fine that the weights add to more than one – the weights are just relative.
(Note: the output variable is per play, rather than per possession. I define a play as any offensive opportunity to attempt a shot, so a possession with a miss and offensive rebound would be two plays. This isn’t too important, but helps make the simulations more “life like,” since I can simulate every possession as it develops. It’s also the reason why there is a -0.07 in the equation, since there are more plays than possessions in a game, so per play turnover rates will be lower.)
Step 2: Predicting Tournament Efficiency Stats for 2012
This is the easy step: just plug and chug. For example, for the 8 Memphis vs. 9 St. Louis match up this year, Memphis turned the ball over on 18.2% of regular season offensive possessions and St. Louis got turnovers on 22.7% of defensive possessions. Plugging these numbers into the equation above gives an expected turnover per play rate of 16.7% for Memphis, which drops to 15.8% after conference adjustments.
Step 3: Simulate the Games
I went through steps 1 and 2 for the following statistics for each team in each match up, all related to Dean Oliver’s four factors:
- Factor 1: 3 pt shooting %, 2 pt shooting %, foul shooting %
- Factor 2: % of potential offensive rebs secured (including balls out of bounds)
- Factor 3: % of offensive plays ending in a turnover
- Factor 4: 3 pt attempts as a % of non-turnover plays, 2 pt attempts as a % of non-TO plays, free throw trips as a % of non-TO plays
Factor 4 is the most confusing. It’s similar to Oliver’s FTA/FGA factor, but has more value for simulations, since it tells me how often teams get a three point attempt, a two point attempt, or a trip to the line (on plays without a turnover).
With these stats in hand, here’s the decision tree for each possession of each simulated game:
At each branch point, I use the statistics above and a random number generator to determine which branch to take. When Memphis has the ball against St. Louis, for example, I take the “Turnover” branch with 15.6% probability at the start of each play. I keep track of points scored for each team the whole way, and estimate the time per play in a similar way based on regular season pace. At the end of each simulated game (when 40 minutes are up), I get a final score.
(Note: the free throw branch is oversimplified. I actually have teams shoot two free throws each time, which could still be improved.)
How Many Simulations?
For my initial predictions, I simulated each game 50 times. This is plenty for an uneven match up like 1 North Carolina vs. 16 Vermont. Just a few simulations make it clear that North Carolina will win the majority of the time. However, I quickly found that close match ups require many more simulations before each team’s odds of winning stabilize. For the second round, 200 simulations gave me 76% odds of Vanderbilt beating Wisconsin in the second round, which I used for my prediction. Bumping the simulations to 1,000 or 5,000 flipped the game to Wisconsin at about 55% odds. This was an extreme case; most shifts weren’t as big and didn’t change my second round picks. Still, I cost myself a Sweet Sixteen team by running too few simulations!
For the Sweet Sixteen predictions below, I did 8,000 simulations for each game. I list my predicted winners below, in order of certainty:
- 1 North Carolina over 13 Ohio (North Carolina wins 87.1% of simulations)
- 3 Baylor over 10 Xavier (76.9%)
- 1 Michigan State over 4 Louisville (62.5%)
- 2 Kansas over 11 North Carolina State (62.1%)
- 1 Ohio State over 6 Cincinnati (65.7%)
- 1 Syracuse over 4 Wisconsin (56.6%, no Fab Melo adjustment)
- 7 Florida vs. 3 Marquette (56.3%)
- 1 Kentucky over 4 Indiana (55.3%)
If these percentages are correct and these eight games were played over and over, I should expect to get 5.2 out of 8 games right on average. The last three favorites are most likely to lose, especially Syracuse without Fab Melo.
Accounting for Melo
Let’s look at Syracuse vs. Wisconsin in detail. Here are the predicted efficiency stats from step 2 that went into the simulations in step 3 (Syracuse listed first, then Wisconsin):
- 2 pt %: 48, 43
- 3 pt %: 33, 34
- FT %: 69, 74
- OReb %: 31, 30
- TO %: 12, 15
- 2 att %: 65, 49
- 3 att %: 24, 38
- FT att %: 11, 13
I would expect these values on average if they played this game many times. Syracuse will tend to shoot better on two pointers (48% to 43%), but the teams are nearly even on three pointers, and Wisconsin is 5 percentage points better at the line. For both teams, free throws have the highest return per play (over 1.35 points expected on two shots), followed by threes (about 1 point per shot) and twos (just under 1 point per shot). So, Syracuse’s high predicted affinity for 2 pointers (65% of non-turnover plays) actually lowers their points per play relative to Wisconsin, who I expect to shoot a LOT of threes. But Syracuse’s two point shooting percentage still helps, and I predict that Wisconsin will turn the ball over slightly more and get fewer offensive rebounds. These differences are enough to swing the simulations to Syracuse 56.6% of the time.
With Melo out, what might change? We could reasonably expect Wisconsin to shoot a percent or two better on two pointers, and both offensive rebounding percentages should shift in their favor. If I move those three numbers by 2 percentage points each, Wisconsin wins 53.8% of my simulations.
This example highlights the value of simulations as opposed to simple rankings comparisons. Match ups matter, since different combinations of strengths and weaknesses combine in important ways. Simulating based on efficiency stats for each facet of the game helps get these combinations right. It also allows me to stress test the results based on specific changes in strategy or personnel. Now we’ll see how it performs in round 3.