ncaa march madness basketbal

NCAA Basketball – To Score More, You Should Miss More

Econometrics


Using a multivariate analysis, I studied what factors affect the men’s NCAA basketball average points per game. The independent variables I studied were the average field goals made per game, the average field goals attempted per game, and the field goal percentage throughout the year for the top 64 teams.

mens ncaa basketball points per game regression

Looking at the model it is clear that with a 1 percent increase in field goal percentage a team’s average points per game will decrease by .09 points, holding all else constant. The greater your field goal percentage the fewer points on average you are going to score per game holding all else constant.

Although that doesn’t make much sense, as your shooting percentage goes up your points per game go down?

There is a good reason for this.

In econometrics, we run multivariate regression analysis all the time. Which is the process of finding the correlation between multiple independent variables and the dependent variable we are focused on.

We run often run into a problem with multivariate regression models. That is multicollinearity, also known as collinearity.

Multicollinearity occurs when two or more independent variables in a regression model are correlated.

This is not what we want in our regression models because our independent variables, well, should be independent of one another.

Types of Multicollinearity

Multicollinearity isn’t always a bad thing. In fact, there are multiple types of multicollinearity. Sometimes it can be useful in our regressions.

Structural Multicollinearity

When running a regression model you want to find the best fit for your data. This means you might have to, for example, square the values of one of your independent variables to create a new variable. So if you have X as one of your variables you would also create and test X2 to see if that’s a better fit.

This will create collinearity between the independent variables X and X2. This makes sense since the variable you created comes from the original variable X. They are correlated through squaring your X variable.

This is an intentional case of multicollinearity, we are more interested in data multicollinearity.

Data Multicollinearity

This kind of collinearity is achieved through poor regression initialization. It occurs when there is insufficient data, missing dummy variables, and repetitive or identical variables.

Insufficient data can cause the interpretation of the model to be wrong because the estimators do not have enough data to be entirely accurate. This can simply be counteracted by gathering more data.

A dummy variable is a variable that is binary or represents the value of either 0 or 1. Like gender, males can be represented as 0 and females as 1. This is how you add categorical variables into your regression. Forgetting to include a dummy variable will significantly skew the estimators of the others that were included.

Perfect Multicollinearity

Repeating variables is also a cause of multicollinearity. Like if one of your variables is length in inches and another variable is length in centimeters. These are the same variables, this is what we call perfect multicollinearity.

Similarly adding in a variable that is the sum of two other variables in your regression will also lead to multicollinearity. It is repetitive and unnecessary.

Potential Problem

The idea of a regression model is to regress back to the root cause of the dependent variable. We do this by isolating the impact of each variable while holding all other variables constant.

The problem occurs when you have two independent variables that correlate. It is difficult to isolate one variable and hold the others constant when changes in one variable causes changes in the other. The changes in the variables tend to change in unison, making it difficult to isolate. The stronger the two independent variables are correlated the harder they are to isolate.

The Analysis

Having this correlation in the model can cause significant problems when trying to analyze the model. The estimators for the model swing rapidly and are very sensitive to change. It also decreases your precision with estimators, meaning your p-value isn’t very trustworthy. This means you can’t make any meaningful statistically significant claims.

It is quite easy to identify multicollinearity in your model. Seeing a significant change in the value of your estimator or if the sign changes with just a small change to your model is a good indication that you may have multicollinearity.

Men’s NCAA Basketball Regression

Looking back at our initial claim we made that as our shooting percentage increases our points per game decrease, it is clear that it was a very poor regression and claim.

mens ncaa basketball points per game regression

Disregarding the fact that the t-value is statistically insignificant or that our standard deviation is quite large. The crucial thing to point out is the other independent variables we included in the model.

They are highly correlated with one another. It is very difficult to isolate these variables from one another to test their effects. You can’t hold average field goals attempted and field goal percentage constant and see the effects of changing the variable average field goals made. As you change the average field goals made per game your field goal percentage or average field goals attempted will have to change as well.

There is multicollinearity between the variables. To make it even more clear here is another regression of the effect of just field goal percentage on average points per game.

mens ncaa basketball field goal percenage on points per game regression

Look at the estimator. It switched signs and is now positive at 0.5452 which checks off both the boxes of indications of multicollinearity, sign change and wild swing of the estimator value.

In general, this also is not a great regression model, but that is not the point of this article. It is to show you that even during multivariate analysis you have to watch out for glaring problems in the initialization and analysis of the model, otherwise you might be caught believing that increasing your shooting percentage will decrease the number of points you score per game.

Leave a Comment

Your email address will not be published. Required fields are marked *