In this tutorial, we will be attempting linear regression and variable selection using the cirrhosis dataset. We attempt to predict incidence of cirrhosis on a population using a few descriptor variables from that population. The variables we have are:
- urbanpop – The size of the urban population
- lowbirth – the reciprocal of the number of births to women between 45 and 49, times 100
- wine – Consumption of wine per capita
- liquor – Consumption of hard liquor per capita
- cirrhosis – The death rate from cirrhosis
First step in analysis is of course to read the data in to SAS. Download the dataset, and replace the filepath with what it is on your machine. (try shift-right-click on the drinking.txt file)
filename fn 'filepath\drinking.txt'; /* replace filepath to run code */ data liquor; /* New dataset 'liquor' */ infile fn; /* using dataset at previously declared filepath */ input id one urbanpop lowbirth wine liquor cirrhosis; run; /* id and one are unnecessary */ data liquor; set liquor(drop=id one); /* Dropping unnecessary variables */ run; |
Declaring a filename variable isn’t necessary, but does make the resulting code look better. The variable ID is a unique identifier for each observation, and isn’t really necessary. Similarly, the variable one is just a vector of ones. We can safely eliminate both. the next step is to begin preliminary data analysis. A good idea is to see how correlated the data is. To do this we try
proc corr data=liquor; var urbanpop lowbirth wine liquor; run; |
We can see that the data is heavily correlated. As we are trying to avoid multicollinearity, we can expect that some of the predictors will be dropped from the model.
proc reg data=liquor; model cirrhosis = urbanpop lowbirth wine liquor; run; |
The t values for both urbanpop and liquor are insignificant with the other 3 in the model. We have some other choices we can look at before doing variable selection. A good thing to look at is VIF, or Variance Inflation Factor. The higher the VIF, the more that variable can be predicted by the other variables in the model, and thus the higher the multicollinearity. The VIF gives an estimate of how much the variance of the coefficients is increased by the inclusion of that variable. We tend to try to keep the VIF for any particular variable under 5.
proc reg data=liquor; model cirrhosis = urbanpop lowbirth wine liquor/vif covb; run; |
The VIF for lowbirth is the highest, so we might consider removing that variable from the model. However, urbanpop also has a high VIF, and is not significant in the model. It would be a good idea to remove that variable from the model before continuing. Because I’m also going to demonstrate variable selection via stepwise, I’m going to leave it in.
The covb option gives the variance/covariance matrix of the beta coefficients. It allows you to determine how closely related any two covariates are. This method is not quite as useful as VIF, because it only checks for pairwise correlation, whereas VIF checks for true multicollinearity. However, it is a good idea to look at this matrix to understand your data better.
proc reg data=liquor; model cirrhosis = urbanpop lowbirth wine liquor/p r influence; run; |
The P option gives the predicted values for each observation. The R option gives each observations residuals, along with a graphic method of viewing those residuals. This provides another way of finding outliers.
The influence option gives us the ability to see how much influence any given observation had on the overall fit, or any of the beta values. Variables with Cook’s Distance greater than .10 are generally termed as outliers. In keeping with the principles of linear regression, we don’t want any individual points having too much influence on the fit of a regression, or the individual beta’s.
proc reg data=liquor; model cirrhosis = urbanpop lowbirth wine liquor/selection=stepwise; run; |
Here we finally try running stepwise variable selection. The criteria used is AIC, or Akaike’s Information Criterion. A model that has a lower AIC is considered better. In doing variable selection, SAS removes both urbanpop and liquor from the model, confirming our earlier assessment that they were not helpful in fitting the model. In looking at the value, The value did not change much, however the adjusted is slightly better. Of course, this does not really showcase the value of stepwise variable selection as it tells us what we already knew.
proc reg data=liquor; model cirrhosis = lowbirth wine/p cli clm; run; |
This is our final model. As we already went over, p gives the predicted value for each observation. cli gives the prediction interval for that observation (interpret as in what interval would a new observation with those covariates fall into?), while clm gives the confidence interval for that observation (interpret as in what interval would a bunch of new observations with those covariates fall under?).