# Nested Data

Nested data is data for which a variable (or set of variables) signifies an observation as belonging to a group. We might refer to simple nesting (with 1 layer of groups) as categorical. Use of categorical data with linear models is called ANOVA. [i](Analysis of Variance)[/i]

• is the vector of dependent observations.
• is the detailing the observed values of independent variables. The first column is the mean or independent variable intercept.
• describes the parameters of the model. It is a vector of coefficients, meant to be multiplied against to get the vector of predicted values, .
• is the vector of random errors. We assume that .

ANOVA
Classical Effects Model with One Way ANOVA

I had said before that use of categorical variables within the linear regression framework can be termed as ANOVA. Let’s look at how that works. Suppose we have a categorical variable, which defines what group an observation is in. Also suppose that variable can take one of 4 possible levels, signifying 4 possible groups. Within each group, we have 2 observations.

Membership within each group is signified by a dummy variable, taking on the values of either 0 or 1. Because we are treating the model parameters as fixed but unknown, we call the Classical Effects model a Fixed Effects Model.

• is the coefficient associated with group .
• is the per-observation error associated with group , individual .

This method of defining the variable presents a problem, though. When we attempt to take the affine transformation and solve the standard regression problem, we must take the inverse of the matrix . Using the above matrix, the Affine transformation is singular, and thus not invertible.

The solution is in realizing that in order to know what of 4 groups an observation belongs to, we only need 3 dummy variables. If we treat one group as control, and all other groups as deviations from that control, we can accomplish the task.

Thus, we interpret as deviation from the control group. The matrix is now invertible, and thus we can solve for ‘s 2,3,4 as a deviation from .

An alternate, though less used parameterization would be to remove the intercept and have individual intercepts per group.

This parameterization is called the Means Model. The weakness of this parameterization is we are given no information about the significance of differences between any particular group and the control group, though it does allow simple interpretation of the means of the various groups.

Two Way ANOVA

Dimensions of ANOVA are defined by the way in which the categories are stacked. One Way ANOVA refers to a single layer of groups. Two Way ANOVA refers to two layers, nesting sets of groups within groups.

As an example, let’s think of students in a classroom. Each individual student is an observation. These observations are correlated by the fact they belong to the same class. At this level, they are uncorrelated to students in other classes. We identify individual observations with the subscript notation , denoting the ‘th group and ‘th observation within that group.

Let’s say a school has 5 classrooms, with 30 kids per class. In this case, we can make comparitive statements between classrooms using One-Way ANOVA. If we decided to include other schools, and wanted to make comparitive statements between schools, then we would need to nest the groups of classrooms per school within each school. This adds another layer, and we now call this [i]Two Way ANOVA[/i]. We identify individual observations using the subscript , denoting the ‘th top level group, the ‘th nested group, and the ‘th observation within that group.

This idea of nesting can be extended ad-infinitum.

# Workshop on knitr, LaTeX and R

The SCG is putting on a workshop on LaTeX, R, and integrating the two with knitr.

The workshop covers the structure of the LaTeX document, how to insert R code and results in LaTeX, and get an output document.

Source Code

R and Sweave Presentation

# Linear Regression in R: Abalone Dataset

This tutorial will perform linear regression on a deceptively simple dataset. The abalone dataset from UCI Machine Learning Arvhives comes with the goal of attempting to predict abalone age (through the number of rings on the shell) given various descriptive attributes of the abalone (Shell sizes, weights of whole abalone and parts of shucked abalone). In following this goal, we will attempt to predict rings using the the shell sizes (height, length, width), and weights (shell weight, shucked weight, viscera weight, whole weight). The problem associated with this dataset is that these descriptive attributes are all heavily correlated.

First we need to load the data.

aburl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abnames = c('sex','length','diameter','height','weight.w','weight.s','weight.v','weight.sh','rings')
abalone = read.table(aburl, header = F , sep = ',', col.names = abnames)


Then we load the libraries we’ll be using.

library(MASS)
library(rms)

##### Step 1: Verify Data Integrity

The first step of analysis is to examine the data integrity. We can’t draw conclusions on heavily suspect data.

summary(abalone)


We can tell right off that there exists a problem with the abalone height, where some values are registered as 0’s. This is not possible. We will need to investigate, so we take a look at those values.

abalone[abalone$height==0,]  We see two abalone for which the height is obviously not 0, but is not properly recorded. We will have to set these to null values, and exclude these observations from the analysis (if height turns out to be significant) abalone$height[abalone$height==0] = NA  The minimum weights are also a bit low compared to other measurements, so We should take a look at them. abalone[abalone$weight.w < .01,]


It seems these abalone are legitimately really small, so this is probably not a data entry error. Thus we cannot exclude it. Now let’s examine the structure of the data. Specifically, we’re looking at pairwise correlation.

as.matrix(cor(na.omit(abalone[,-1])))


As we can see, the data is heavily correlated. This is a problem if we attempt analysis off these raw numbers. We’ll use various methods later on to try and reduce the apparent correlation.

This is only a preliminary approach to verifying data integrity. We’ll be getting into more detail later after we do some fits.

##### Step 2: Initial Fit
abfit1 = lm(rings ~ sex + length + diameter + height + weight.w
+ weight.s + weight.v + weight.sh, data = abalone)
abfit2 = stepAIC(abfit1)
summary(abfit2)


The sex being used as the baseline in this case is female. Since male is not significantly different, and infant is the only difference, we can redefine this feature to define whether gender has been expressed (infant vs non-infant)

abalone$sex = as.character(abalone$sex)
abalone$sex[abalone$sex != 'I'] = 'K'
abalone$sex = as.factor(abalone$sex)


A bit worrying is that the AIC is picking as significant all four of the weight measures, despite that they should be linear functions of eachother.

Whole Weight = Shucked Weight + Viscera Weight + Shell Weight + Unknown mass of water/blood lost from shucking process

We should investigate this abalone weight problem, and ensure that the variables in the model are not functions of eachother.

abalone$weight.diff = abalone$weight.w -
(abalone$weight.v + abalone$weight.s + abalone$weight.sh) par(mfrow=c(1,1)) hist(abalone$weight.diff,breaks=50)


It would appear that there are some instances where the whole weight is less than the sum of the components, which does not stand up to logic. We will examine the worst offenders of these.

abalone[abalone$weight.diff < -.1,]  This is somewhat frustrating, as we know these are wrong. In cases where the shucked weight is greater than the whole weight, we might postulate that the person entering the data got them backwards. Because of this lack of faith in the measurements, and realizing that they should be a linear function of eachother, we should stick to only using one or two of the weight measurements. We’ll pick which weight measure to use. abfit2.1 = lm(rings ~ sex + diameter + height + weight.w, data = abalone) abfit2.2 = lm(rings ~ sex + diameter + height + weight.s, data = abalone) # ! abfit2.3 = lm(rings ~ sex + diameter + height + weight.v, data = abalone) abfit2.4 = lm(rings ~ sex + diameter + height + weight.sh, data = abalone)  Now let’s examine the best fit model to try and identify any potential outliers. The best model uses the shucked weight as a predictor. par(mfrow=c(2,2)) plot(abfit2.2)  We see observation 2052 as a potential outlier. We shall investigate to see why that is. abalone[2052,] summary(abalone[abalone$sex=='K',])


For comparison, we also print out summary information for known genders. length, diameter, and weight say this abalone is slightly small, but otherwise unremarkable. Height on the other hand would seem to indicate that this abalone is exceptional. When we look at the height, it would seem that there was a data entry error. 0.130 height seems believable, 1.130 does not for an abalone that is otherwise small. We will change the entered height.

abalone$height[2052] = 0.130  Also, we’ll try picking some functions of the various weights to see if we can get a more compact result. First, the geometric mean of the sub-weights. abalone$weight.mean1 = (abalone$weight.s*abalone$weight.v*abalone$weight.sh)^(1/3) abalone$weight.mean2 = (abalone$weight.w*abalone$weight.s*abalone$weight.sh*abalone$weight.v)^(1/4)
abalone$weight.norm1 = sqrt(abalone$weight.s^2 + abalone$weight.v^2 + abalone$weight.sh^2)
abalone$weight.norm2 = sqrt(abalone$weight.w^2 + abalone$weight.s^2 + abalone$weight.v^2 + abalone$weight.sh^2)  We’ll also define some other measures of size that incorporate all 3 dimensions. This allows us to describe 3 dimensions with 1 (hopefully). We pick the euclidean norm of its size, along with the geometric mean of size. We’ll pick one of these measures to use as well. abalone$size.norm = sqrt(abalone$length^2 + abalone$diameter^2 + abalone$height^2) # Norm of vectors abalone$size.mean = (abalone$length*abalone$diameter*abalone\$height)^(1/3)         # Geometric Mean


Because we’re looking at many transformations of the variables, and we won’t accept certain variables in the model at the same time, we have to make comparisons manually.

abfit3 = lm(rings ~ sex + size.norm + weight.w + weight.s + weight.v + weight.sh, data = abalone)

abfit3.1 = lm(rings ~ sex + size.norm + weight.w, data = abalone)
abfit3.2 = lm(rings ~ sex + size.norm + weight.s, data = abalone)
abfit3.3 = lm(rings ~ sex + size.norm + weight.v, data = abalone)
abfit3.4 = lm(rings ~ sex + size.norm + weight.sh, data = abalone) # best

abfit4.1 = lm(rings ~ sex + size.norm + weight.norm1, data = abalone)
abfit4.2 = lm(rings ~ sex + size.norm + weight.norm2, data = abalone)
abfit4.3 = lm(rings ~ sex + size.norm + weight.mean1, data = abalone)
abfit4.4 = lm(rings ~ sex + size.norm + weight.mean2, data = abalone)

abfit5.1 = lm(rings ~ sex + size.mean + weight.norm1, data = abalone)
abfit5.2 = lm(rings ~ sex + size.mean + weight.norm2, data = abalone)
abfit5.3 = lm(rings ~ sex + size.mean + weight.mean1, data = abalone)
abfit5.4 = lm(rings ~ sex + size.mean + weight.mean2, data = abalone)

abfit6 = lm(rings ~ sex + size.mean + weight.w + weight.s + weight.v + weight.sh, data = abalone)
abfit6.1 = lm(rings ~ sex + size.mean + weight.s + weight.v + weight.sh, data = abalone)
abfit6.2 = lm(rings ~ sex + size.mean + weight.s + weight.sh, data = abalone)
abfit6.3 = lm(rings ~ sex + size.mean + I(size.mean^2) + weight.s + weight.sh, data = abalone)


We will want to visually check the fit and validity of our model.

par(mfrow=c(2,2))
plot(abfit6.3)


Thus far, we’ve been approaching the dataset with a very linear method in mind. We have pretty much exhausted our options dealing with this problem without changing the dependent variable. Thus far we have been unsucessful in getting the QQ plot line to merge to the diagonal.

When we look at the residuals vs fitted values plot, we see a fan shape indicating that as the fitted values increase, so to do the scale of the residuals. One way to account for that is to use a log transformation of the dependent variable. Subsequent analysis will take log(rings) as the dependent variable to account for this.

abfit7 = lm(log(rings) ~ sex + size.mean + weight.s + weight.sh, data = abalone)
plot(abfit7)


We’re closing in on a final model. Here at least we have two separate measures of weight that don’t measure the same thing.

vif(abfit7)


The VIF is a bit worrying, because a VIF higher than 5 is cause for concern. + This is a highly correllated data set. There are methods of dealing with multicollinearity within a data set which involve declaring new variables as linear combinations of existing variables. One of these methods is called “Principle Components Analysis”. We won’t be using it here, though. We’re going to do the best we can by reasoning through the model.

abfit7.1 = lm(log(rings) ~ sex + size.mean + I(size.mean^2) + weight.s + weight.sh, data = abalone)
abfit7.2 = lm(log(rings) ~ sex + log(size.mean) + weight.s + weight.sh, data = abalone)
abfit7.3 = lm(log(rings) ~ sex + log(size.mean) + weight.s*weight.sh, data = abalone)
abfit7.4 = lm(log(rings) ~ sex + log(size.mean) + weight.w, data = abalone)


In going through these models, it becomes clear that 7.2 and 7.3 offer the best visually fitting model. Performing an anova on 7.2 and 7.3 results in the additional interaction term in 7.3 not being significant, so we don’t include it. We check the VIF for 7.2, and find all elements with a VIF under 6. While that is high, it’s the best we can do with this dataset without bringing in more advanced methods.

As such, our resulting model is 7.2

# R 3.1.0 Released

R 3.1.0 (Spring Dance) released!

http://cran.rstudio.com/bin/windows/base/release.htm

# Power and Sample Size for Two-sample Tests

An often used method in applied statistics is determining the sample size necessary to view statistically significant results. Given the intended power, we can calculate the required sample size. Given the intended sample size, we can calculate the resulting power. Before we go in to how this works, we need to define a few things.
Error Types

Truth
H0 H1
Test Negative
Don’t Reject
True Negative False Negative
β
Positive
Reject
False Positive
α
True Positive
Power = 1 – β

In looking at a two-sample test, what we’re actually focusing on is the difference of the sample means. Let’s say we have 2 samples.

For purposes of this demonstration, we will assume that the data in these two samples follow a normal distribution. Therefore, their sample mean follows a normal distribution. We also assume that and have the same variance. And we also assume that is independent of . Therefore,

Furthermore, with the equivariance assumption, we are assuming that .

We know the difference of population means to be . As we don’t know either population mean, we approximate that difference with sample means. The variance of the difference of sample means is simply the sum of variance of individual sample means.

Since we see the variance of the difference, we can calculate the standard error very simply.

Furthermore, we can declare to be a function of . We might choose to sample more from one group over the other because the cost of sampling from that group is cheaper. In order to get the greatest value for the money, we’re better off sampling more from the cheaper group. We set to be the ratio of , such that . Therefore, the standard error is calculated as:

Now that we have our standard error for the difference, we can proceed with calculating the sample size.

One Sided Test
In the one sided test, we are establishing whether or not a particular population’s mean is greater than the other.

Alternatively, if we were trying to establish the converse,

Two Sided Test

The only thing that changes for the two sided test is we use instead of . Therefore,

# Power and Sample size for Proportion Data

An often used method in applied statistics is determining the sample size necessary to view statistically significant results. Given the intended power, we can calculate the required sample size. Given the intended sample size, we can calculate the resulting power. Before we go in to how this works, we need to define a few things.

Error Types

Truth
H0 H1
Test Negative
Don’t Reject
True Negative False Negative
β
Positive
Reject
False Positive
α
True Positive
Power = 1 – β
• = False Positive Rate.
This is the chance of rejecting the null hypothesis , given that the null hypothesis is true.
• = False Negative Rate.
This is the chance of failing to reject the null hypothesis, given the alternative hypothesis was true.
• Power is viewed as the complement of , the false negative rate. The power of the test is the chance to reject the null hypothesis, given the null hypothesis is false. (Given the alternative hypothesis is true)

Using these error types, we can make guesses as to the sample size necessary to achieve significant results to support our alternative hypotheses. The actual calculation for power and sample size is a little different from the normally distributed data, because in proportional data the variance is a function of the proportion, rather than being independent of the mean.

Sample Size Calculation

• Case 1: One Sided Test
Given ,
Given

In this calculation we’re using . We will show later why the direction is not important, merely that we’re only considering values on one side of . Because follows a Bernoulli distribution, is a good estimator for .

Remember that in a one-sided test with , we’re going to reject if .

• Case 2: 2-sided Test
In the two-sided test, we reject if . The calculation for the 2-sided test follows very similarly to the one-sided test, however we change the to to reflect that we’re allowing values on both sides of the null hypothesis. The formula for sample size is thusly:

All else remains the same.

# Power and Sample Size for Normally Distributed Data

An often used method in applied statistics is determining the sample size necessary to view statistically significant results. Given the intended power, we can calculate the required sample size. Given the intended sample size, we can calculate the resulting power. Before we go in to how this works, we need to define a few things.

Error Types

Truth
H0 H1
Test Negative
Don’t Reject
True Negative False Negative
β
Positive
Reject
False Positive
α
True Positive
Power = 1 – β
• = False Positive Rate.
This is the chance of rejecting the null hypothesis , given that the null hypothesis is true. Note that this value does not depend on the existence of an alternative hypothesis.
• = False Negative Rate.
This is the chance of failing to reject the null hypothesis, given the alternative hypothesis was true. Note that this value does depend on the alternative hypothesis.
• Power is viewed as the complement of , the false negative rate. The value for power is . The power of the test is the probability of rejecting the null hypothesis, given the null hypothesis is false. (Given the alternative hypothesis is true)

Using these error types, we can make guesses as to the sample size necessary to achieve significant results to support our alternative hypotheses.

Sample Size Calculation

• Case 1: One Sided Test: In a one-sided test, we will only reject the null hypothesis if our sample mean falls to the side of it that we specified in advance. That means if we have a null hypothesis, , our alternative hypothesis is that , (exclusive) or .
Given

We’re using , though this calculation will work for as well. We’re only concerned here that this is a one-sided test.

Due to the fact and can be interchanged with the quantity in the denominator always remaining positive, we can use this calculation for the case where as well.

• Case 2: Two-sided Test: In a two-sided test, we will accept sample means that fall to either side of the null hypothesis as evidence against the null hypothesis. By this, if we have a null hypothesis , our alternative hypothesis is that , and that . That means that we will reject the null hypothesis if the sample mean is sufficiently far away from the null hypothesis, in either direction.

In the event of a two-sided test, we are required to correct the value at which we reject the null hypothesis, to reflect that we are accepting values on both sides of the null hypothesis. This shifts out the value correspondingly.

# Pretty Graphs with ggplot2 (R)

The native graphics options in R are very powerful and useful for generating output. However, the packages available for R extend your capability far beyond what is natively available in R. The most commonly used package for non-native graphics is ggplot2.

• Getting Started with qplot – A brief introduction to qplot(), the training wheels plotting function of ggplot2. The syntax of qplot() is similar to the familiar plot() function available in base R.
• ggplot Symposeum – This hour-long presentation was given at Vanderbilt University. Some of the code used has since been deprecated, however this should give you an idea of the power of ggplot2. The code used in the presentation is here.

# Proc SQL – SAS

SQL stands for “Structured Query Language”. It’s a language created for interacting with relational databases. SAS implements a version of this inside proc sql. This allows us to leverage the power of SQL to solve our problems.

proc sql; create table out.projdata2 as select a,b,c,d, /* vars from source */ a^2 as aa, /* New vars */ b^2 as bb, mean(c) as cbar /* Summary statistics */ from out.projdata; /* Input Dataset */ quit;

SQL is guided by an ANSI standard, but it is up to the implementers to make their respective syntaxes match ANSI standard or not. The SQL syntax in proc SQL differs somewhat from other implementations of SQL.

• Variables – Variables, or columns in the data, are imported using the select function. This can occur a couple ways. First, if we want to select all the variables in a dataset, then we can use the asterisk (*). Else, we call variables by name. Case doesn’t matter, but the first instance a variable name is used will cause SAS to remember that case for all output of the variable. So if you want the output to look pretty, use the most appropriate case when you first reference the variable.
We can declare new variables by setting up some function of other variables, and giving them a name using as. We can do more than just line-level functions. Using SQL, we can make summary statistics, by-group summary statistics, indices, and more. We’ll go into by-group processing later on.
• Joins – Joins are methods of merging two datasets. Typically speaking, we use a key, being a variable or set of variables that appear in both datasets on which we will merge. Joins are an integral feature of SQL, and are further separated by the ways in which they can be done. ANSI Standards specify 5 possible join types: inner, outer left, outer right, full outer, and cross.
• Inner Joins allow us to capture only rows that appear in both datasets, based on the key.
• Outer Left Joins keep all rows that were on the left dataset in the join, and only update those rows for which the key was in the right dataset as well. If a given row was in the left but not in the right, all the added columns will hold null values. Outer right joins work the same way, just keeping the right table intact instead of the left.
• Full Outer Joins keep all rows from both datasets. For a data row in 1 set that doesn’t have a matching row in the other set, the missing data is assigned null values.
• Cross Joins are an interesting beast. They do not use a key, rather they create a Kronecker Cross-Product of the two datasets, multiplying every row in one dataset with every row in the other dataset. For instance, a 10 row set cross-joined with a 20 row set would produce an output set of 200 rows. The functionality is there, but I’ve never had cause to use this type of join.