# Random Forests (R)

We will apply the random forest method to the Adult dataset here.

We will begin by importing the data, doing some pre-filtering and combining into classes, and generating two subsets of the data: The training set, which we will be using to train the random Forest model, and the evaluation set, which we will use to evaluate the performance of the model.

```source("http://scg.sdsu.edu/wp-content/uploads/2013/09/dataprep.r")
```

For more information on what this source code does to prepare the data, visit the tutorial: “Data Preparation with R (Adult Dataset)”. This allows us to move on to model specification.

Random Forests are an extension of tree methods. Where in the CART methodology, the inherent variability of the tree model, and susceptibility to data was considered a weakness of the model, in random forest that is considered a strength. In fact, we take special steps to ensure as much variability in the individual trees as possible. The belief is that by averaging across the high variance, low bias trees, we will end up with a low bias, low variance estimator. We hope that the random forest will do better than any single tree could do. We need to load up the libraries we’ll be using.

```library(randomForest)
library(ROCR)
```

Next we’ll need to find the optimal numbers of variables to try splitting on at each node.

```bestmtry <- tuneRF(data\$train[-13],data\$train\$income, ntreeTry=100,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
```

For this, the syntax I’m using for the tuneRF function means I’m giving the function the first 12 columns of the training set (all values but income) as independent variables. The specification data\$train[-13] means “All but the 13th column on the set data\$train”. As an independent variable, I give the algorithm “data\$train\$income”, which is the income column on the set data\$train. ntreeTry specifies the number of trees to make using this function, trying out different numbers for mtry. See the help file (help(tunerf)) for more information.

The algorithm has chosen 2 as the optimal number of mtry for each tree. Now what we have left to do is run the algorithm.

```adult.rf <-randomForest(income~.,data=data\$train, mtry=2, ntree=1000,
keep.forest=TRUE, importance=TRUE,test=data\$val)
```

In the formula for Random Forest, we specify income versus all other variables in the set as predictors. We specify the mtry value to be 2, matching what we had found earlier with searching for the optimal value. The number of trees we use here is 1000, though computers are fast now and it’s not unreasonable to use 20,000 or more trees to build the model. We ask the algorithm to keep the forest, as well as calculate the variable importance so we can see what contributed most to estimating income classification.

Finally, we can evaluate the performance of the random forest for classification. Since we have been eschewing the confusion matrix in favor of the ROC curve, we should probably continue that. The commands to generate the ROC curve using the ROCR package are as follows:

```adult.rf.pr = predict(adult.rf,type="prob",newdata=data\$val)[,2]
```importance(adult.rf)