# Nested Data

Nested data is data for which a variable (or set of variables) signifies an observation as belonging to a group. We might refer to simple nesting (with 1 layer of groups) as categorical. Use of categorical data with linear models is called ANOVA. [i](Analysis of Variance)[/i] To illustrate, let’s start with the linear model in […]

# Logistic Regression

Logistic Regression is a method of classification using the regression framework. In logistic regression, the output (or target, or dependent) variable is a binary variable, taking values of either . The predictor (or input, or independent) variables are not limited in this way, and can take any value. Logistic regression is based on the logistic […]

# Logistic Regression (R)

Logistic Regression is a type of classification model. In classification models, we attempt to predict the outcome of categorical dependent variables, using one or more independent variables. The independent variables can be either categorical or numerical. Logistic regression is based on the logistic function, which always takes values between 0 and 1. Replacing the dependent […]

# Bagging

Ensemble methods combine multiple classifiers into a single output. Some ensemble methods may combine different types of classifiers, but the ones we will focus on here combine multiple iterations of the same type of classifier. These methods belong to a family of ensemble methods called “Perturb and Combine”. Perturb and Combine Some methods of classification […]

# Random Forests (R)

We will apply the random forest method to the Adult dataset here. We will begin by importing the data, doing some pre-filtering and combining into classes, and generating two subsets of the data: The training set, which we will be using to train the random Forest model, and the evaluation set, which we will use […]

# Classification Trees (R)

Classification trees are non-parametric methods to recursively partition the data into more “pure” nodes, based on splitting rules. See the guide on classification trees in the theory section for more information. Here, we’ll be using the rpart package in R to accomplish the classification task on the Adult dataset. We’ll begin by loading the dataset […]

This tutorial will guide you through a moderately complex data cleaning process. Data mining and data analysis are art forms, and a lot of the steps I take are arbitrary, but I will lead you through my reasoning on them. Hopefully from that you will be able see the logic and apply it in your […]

# Decision Trees

Introduction to Tree Methods Terminology CART Methodology Grow a Large Initial Tree Binary Questions Goodness of Split Criterion Goodness of Split Measure Pruning the Tree Cost Complexity Measure Tree Size Selection Test Sample Method Cross-Validation Method v-Fold Cross-Validation Introduction to Tree Methods Tree methods are a supervised learning method. This means that there is a […]

# Artificial Neural Network

Artificial Neural Networks are methods of classification and/or regression meant to emulate our belief about how a brain or nervous system functions. There exists a network of nodes, or neurons, in which various input values are calculated on. If the end value matches some condition, the neuron fires. Network topology refers to the structure of […]

# Classification Systems

Statistical classification involves the use of various methods and metrics to discriminate outcome variables into their correct groups using input variables. The algorithms used to do this are called classification systems, or classifiers. There are various metrics we use to gauge the performance of our classification systems. If we are referring exclusively to binary outputs, […]