Note: This post was submitted for Teamrankings.com’s Stat Geek Idol competition, with a few modifications/corrections made here (including 200 simulations per game instead of 50, which generates more consistent results). Thanks to Teamrankings for the data!
A few years ago, I ran my office NCAA pool. Right at the deadline, a Swiss economist that I worked for came over, bracket and sheepish grin in tow. He knew nothing about basketball, but someone had explained the seeding system to him. He optimized based on the only inputs he had: he filled out the bracket purely by seed (choosing randomly between the one seeds in the final four). He finished second, of course, which was almost as bad as the year my wife won my pool by choosing teams from her favorite places.
Maybe this Swiss fellow saw through the charades. How predictive is seeding after all? Since 2007, the higher seed has won about 72% of all tournament games (picking the winner randomly when seeds are the same). The favorite by the gambling spread has won 73% of the time, so seeding does quite well. This year, I set out to match this performance using team statistics from the regular season (all stats below are for the 2007 through 2011 tournaments). I’ll have some upset picks at the end (which are the only way to differentiate your bracket from the average in a big pool), but you have to do the leg work with me first.
The question to ask is: what really matters in a basketball game? The team that scores more points wins. That seems like a fair (and obvious) starting point. In fact, the team with the higher average point differential during the regular season won 68% of tournament games (using Pythagorean expectations doesn’t improve this number). Not bad, but not as good as seeding.
Some teams play harder schedules than others, though, which point differential doesn’t capture. A simple proxy for the strength of your schedule is your conference. I defined three levels of conferences: the biggies (Pac 12, Big Ten, Big 12, Big East, ACC, SEC), the middle (Atlantic 10, CAA, CUSA, Horizon League, Mountain West, WCC), and the dregs. Conference association alone predicts 63% of games correctly, but the real gains come from the combination. Using simple linear regression to predict tournament point differential as a function of the strength of your conference and your average regular season point differential gets the winner right 71% of the time.
That’s getting pretty close to the seeding predictions, without using much information about the teams at all. It’s also a little boring, though. There’s nothing specific about match ups in these predictions, and match ups are what make games so compelling in the first place. My last approach works on the match up angle, using regular season efficiency (i.e., per possession or per play) stats for each team to predict the same statistics in the tournament.
Per game stats (points, rebounds, turnovers, etc.) aren’t great because teams play at different speeds. This means that some teams get more possessions to rack up stats than other teams. When two teams play, they have the same number of possessions by definition. What really matters is how many points a team scores per possession, which is driven by just a few efficiency stats: shot selection, shooting percentage, offensive rebound percentage, and turnovers per possession.
I used the regular season values of these variables for each team in simple regressions (along with conference strength) to predict their values in each tournament game. For example, I regressed tournament two point shooting percentage for each team in each game on their regular season two point shooting percentage, their tournament opponent’s regular season two point shooting percentage allowed, and controls for each team’s conference affiliation. Then, consider the match up between Butler and UConn in the final last year. To estimate Butler’s two point shooting percentage in that game, I multiplied the coefficients from the regression by Butler’s two point percentage, UConn’s two point percentage allowed, and their conference affiliations (i.e., I obtained the “predicted value”). I did this for each team in each tournament game. (Crazy stat note: Butler shot a horrifying 9% on two pointers in that game, which is the worst percentage for any team in any tournament game in the last five years).
Next, I simulated each game 200 times, using the efficiency variables to determine the likelihood of each outcome (e.g., made three pointer, turnover, offensive rebound after a miss) on each play. The team that won the majority of the 200 simulations won 70% of past tournament games. There’s still work to do with both these models, but they are pretty simple, and I’ve nearly matched the performance of seeding.
With the coefficients from the efficiency stats regressions, I can estimate the expected point differential for each of the games this year. Just like above, I multiplied the regular season efficiency stats for this year’s teams by the corresponding coefficients from my regressions to output the expected stats for the tournament. I simulated all the first round match ups using these stats, and your official upset picks are (win probabilities in parentheses): 11 N.C. State over 6 San Diego St. (67%), 9 Alabama over 8 Creighton (63%), 12 VCU over 5 Wichita State (61%), 11 Colorado St. over 6 Murray St. (56%), 9 St. Louis over 8 Memphis (54%), and 11 Texas over 6 Cincinnati (53%)! It could be a tough year for six seeds.
9 UConn over 8 Iowa St. and 10 West Virginia over 7 Gonzaga (both 46%) were the next closest, then 14 St. Bonaventure over 3 Florida St. (45%), 10 Virginia over 7 Florida (44%), 11 Colorado over 6 UNLV (43%), 10 Purdue over 7 St. Mary’s (41%), 10 Xavier over 7 Notre Dame (40%), and 13 Davidson over 4 Lousiville (40%).
Coming up later, I simulate THE WHOLE 2012 TOURNAMENT!!!!!!!!