For a little over a month now I have been working on an algorithm to predict winners and losers in college basketball. I first started looking at what stats would help me figure out what team would win. I read many articles and blog post about what good teams did to win basketball games. I settled on the four factors of basketball by Deal Oliver. Those 4 factors are shooting(40%), turnovers(25%), rebounding(20%), and free throws(15%).
So using those factors I started looking at basketball stats. I started out with Effective Field Goal Percentage for shooting. The formula for eFG% is (FG + 0.5 * 3P) / FGA. The formula takes into account 2 pointers versus 3 pointers. While doing some research I started reading about TS%(Tue Shooting Percentage). True shooing percentage takes into account 2 point field goals, 3 point field goals and free throws. So decided to use TS% instead of eFG%
TOV%(Turnover Percentage) estimates how many times a team will turn over the ball per 100 plays. The formula for TOV% is 100 * TOV / (FGA + 0.44 * FTA + TOV
ORB% and DRB%Offensive and Defensive Rebound Percentage for this model I used DRB%. While writing this I am wondering how much of a difference it would be if I used TRB%. Defensive Rebound Percentage is an estimate of the percentage of available rebounds a teams gets. The formula is 100 * (DRB * (TmMP / 5)) / (MP * (TmDRB + OppORB))
Lastly I calculate FTR(Free Throw Rate). Which is a measure of how often a teams gets to the line and how often they actually make free throws. Again while typing this out I noticed something. My formula for free throw rate may be off. I was using FT/FGA. I think the formula should be FTA/FGA. I will update and post if there are any major changes.
So for my data analysis I use python and the pandas library. I will post the notebook on kaggle for anyone who may be interested.
So I use python to go out to sports reference and download the latest stats. I then clean up the data and calculate the the TS%, TOV%, DRB%, and FTR. for each team. There are sites that do this and post the stats. My problem was importing those states. So I just use the raw data and calculate myself.
Once the stats have been calculated I calculate what I call a c_score. Where I assign each stat a weighted value and add TS%,DRB% and FTR while subtracting TOV%.
Each team is assigned a c_score. If a team has a higher c_score than another team I predict that team will win. I currently predict 69% correct for the top 20 teams in the current ap poll. That means I have predicted the winner in each of the teams games to this point . So for all the games NC State has played so far this season my model has predicted the winner correctly 52% of the time.
NC State 52% correct (Model predicted Syracuse to win. The Orange’s top player got hurt in the 5 mins did not play the rest of the game) UNC 39%
Penn State 35%
Colorado 75%
Marquette 71%
Butler 72%
Houston 76%
Illinois 88%
Duke 83%
Gonzaga 85%
Baylor 45%
Kansas 55%
Dayton 86%
FSU 50%
Maryland 59%
Villanova 86%
Auburn 86%
Seton Hall 55%
West Virginia 59%
Oregon 57%
Kentucky 78%
Michigan St 75%
Iowa 75%
LSU 74%
The model does not take into account injuries, home vs away or other factors.
The model has a low percentage with UNC of 39% and a high of 86% with Dayton, Villanova and Auburn. My goal is to have a c_score calculated for all the teams in NCAA tournament bracket using data through the conference tournaments.
If you have any questions or comments please let me know.
Let’s see how you do during March Madness. Keep up the great work bro.