Tag Archives: Performance

Random Forests (R)

We will apply the random forest method to the Adult dataset here. We will begin by importing the data, doing some pre-filtering and combining into classes, and generating two subsets of the data: The training set, which we will be using to train the random Forest model, and the evaluation set, which we will use […]

Classification Trees (R)

Classification trees are non-parametric methods to recursively partition the data into more “pure” nodes, based on splitting rules. See the guide on classification trees in the theory section for more information. Here, we’ll be using the rpart package in R to accomplish the classification task on the Adult dataset. We’ll begin by loading the dataset […]

Model Selection Schema

There are various model selection criteria in use for picking variables in linear regression. Some are applicable to other models outside of linear regression. Akaike’s Information Criterion – A useful criterion for indicating the amount of information contained within variables, and deciding whether to omit certain variables. AIC draws its justification from Information Theory. Coefficient […]

Hypothesis Testing

Hypothesis testing allows us to evaluate a hypothesis, or compare two hypotheses. I include it here as part of the theoretical framework necessary for validation of the statistical models offered. In hypothesis testing, there are two hypotheses. is the null hypothesis. This is generally the hypothesis we are trying to disprove. We attempt to mount […]

Classification Systems

Statistical classification involves the use of various methods and metrics to discriminate outcome variables into their correct groups using input variables. The algorithms used to do this are called classification systems, or classifiers. There are various metrics we use to gauge the performance of our classification systems. If we are referring exclusively to binary outputs, […]