Model Selection Schema

There are various model selection criteria in use for picking variables in linear regression. Some are applicable to other models outside of linear regression.

Akaike Information Criterion

Akaike Information Criterion, or AIC, is a measure of strength of a given model at describing the data, relative to competing models. As such, if all models fit poorly, AIC won’t give any indication of that. AIC was developed by Hirutugu Akaike, in 1974. It draws its justification from information theory.

    \begin{equation*} \text{AIC } = 2p - \ln(L) \end{equation*}

where p is the number of coefficients being calculated in the model, and L is the maximized value of the likelihood function of the model.

AIC effectively penalizes the model for having too many predictor variables. The “best fit” is the one that minimizes the AIC, because it offers the best tradeoff of maximizing the log-likelihood, and minimizing the number of predictors in the model. Additional predictors are only included if they offer enough information to justify their inclusion.

For more information, visit:

Coefficient of Determination

R^2 is the coefficient of determination. It is the percent of change in the dependent variable explained by change in the independent variables. We must first define some notation:

  • \frac{1}{n}\sum_{i=1}^n y_i = \ybar is the mean of the observed y values.
  • \yhat_i is the predicted value for the dependent variable y at the ith observation.
  • Sum of Squares – The Sums of squared values.
    • SS_{tot} = \sum_{i=1}^n(y_i - \ybar)^2 is the total deviation from mean, or total variation in the data.
    • SS_{reg} = \sum_{i=1}^n(\yhat_i - \ybar)^2 is the total variation in the regression model. This is the value we are trying to maximize, with a theoretical ceiling at SS_{tot}.
    • SS_{err} = \sum_{i=1}^n(y_i - \yhat_i)^2 is the total unexplained variation in the model. That is, the deviation from the predicted values. This is the value we are trying to minimize.

    \begin{equation*} R^2 = \frac{SS_{reg}}{SS_{tot}} =  1-\frac{SS_{err}}{SS_{tot}} \end{equation*}

R^2 is an absolute measure of “Goodness of Fit” of a model. In order to get an absolute measure of fit that penalizes high numbers of predictors, we can use the Adjusted R^2, denoted as R_{adj}^2.

    \begin{equation*} R_{adj}^2 = 1-\frac{SS_{err}/df_{err}}{SS_{tot}/df_{tot}} \end{equation*}

Where df_{err}=n-p-1, the degrees of freedom of the error term, and df_{tot}=n-1, the total degrees of freedom. (We subtract 1 degree of freedom for calculation of the mean, \ybar.)

R_{adj}^2 is a useful criterion for comparing models and determining how well a given model represents the data. It penalizes adding more predictors to the model in a similar vein to AIC, but does so in a different manner. The models selected as “best” will not necessarily be the same, and AIC is generally preferred. In some disciplines (e.g., econometrics) both of these criteria are of limited use.

For more Information, visit:

Variable Inflation Factor

The Variable Inflation Factor gives a numerical value to the severity of multicollinearity in a dataset. Multicollinearity is the degree to which some of the predictors in your dataset can be approximated as linear combinations of other predictors in your dataset. Ideally, we would have independent predictors in the dataset. Multicollinearity is the degree to which that is not the case, and VIF or Variance Inflation Factor, gives us a quantitative way of describing the Multicollinearity for a given variable.

For a given predictor variable in a predictive model, the VIF for that variable is a function of the R^2 value for a regression model predicting that variable against all other predictor variables in the predictive model. For Variable k:

    \begin{equation*} VIF_k = \frac{1}{1-R_k^2}  \end{equation*}

Where R_k^2 = R^2 of a model \hat{x}_k = \beta_0 + \beta_1x_1+\ldots+\beta_{k-1}x_{k-1}+\beta_{k+1}x_{k+1}+\ldots+\beta_px_p. The R^2 value, as the coefficient of determination, is the percent of change in the dependent variable explained by change in the predictor variables. Therefore the higher the R^2, the more the variable can be approximated by the other variables in the model. The variable inflation factor is taken as a function of this. The higher the predictive ability of the other variables on that variable, the more the variance of the coefficient of that variable is, and the more difficult the inferences made with that variable are. As a general rule, we try to omit variables with a VIF higher than 5. That may be difficult to do, so we use this as simply a guideline.

None of these regression criteria are absolute, and many times they do not agree with eachother on what the best model is. You as a statistician must use your best judgement for the model selection.