R is a powerful and free statistical programming language. It runs on a wide variety of operating systems and architectures, and has a huge wealth of plugins made possible by its free nature and simple language structure. New ideas can be prototyped and pushed out very quickly, resulting in R always being on the forefront of innovation in statistics.
In order to use R, we should begin learning how to program in general. Statistical programming is about the manipulation of data (stored in variables), and the methods we use to manipulate that data. Thus, we will separate the tasks to learn into those groups.
Variables are logical names for pieces of data. Just like the use of variables in algebra, using a variable name to contain a value allows us to abstract or generalize away from the data. Variables can be given any alpha-numeric name that starts with a character, that is not already taken by methods or special names. I will separate variables into two types: simple and composite.
Simple Variables contain a single value. We can further separate these variables by the types of values they contain.
- Boolean Variables are the simplest type of variables. They contain a single bit, and indicate a binary response. (0,1), (no,yes), (false,true) and so on. They can be declared as such:
a = TRUE b = FALSE
- Integers refer to numbers with no floating point (no decimals, no parts of numbers). Integers are used for counting, iterating. In R, it’s somewhat difficult to declare a single integer. In most cases, attempting to do so will actually declare a numeric.
- Numerics are numbers which can potentially contain floats. They can still contain integers, though. they can be declared as such:
a = 1 b = 3.342
- Characters are variables which contain non-numeric data. Most programming languages make a distinction between characters and strings, but R does not. I should specify that this non-numeric data is simply stored as non-numeric. That doesn’t mean it doesn’t contain numeric characters. They are assigned using quotation marks. R does not care if you use single or double quote marks, but in an individual assignment, you must be consistent.
a = "a" b = "6"
- Strings are variables that contain one or more characters. They are assigned identically to characters. (rather, characters in R are just strings)
a = "abcd6" b = 'pretty fly for a white guy'
Composite Variables are variables that contain more than just a single data value. In R, we can create progressively more complex and layered variables ad infinitum, so it’s a bit difficult to explain complex data structures in total. I will mention a few types and show examples of how to layer them.
- Vectors are a single-dimensional composite data structures. They will contain multiple values in order. They can contain different types of values as well. They can be used in matrix algebra, bound to matrices, or bound together to form data frames.
a = 1:6 # creates a vector of integers. 1,2,3,4,5,6 a = c(1:6) # creates a vector of numerics, 1,2,3,4,5,6 a = c("a","b","6","d","feg") # creates a vector of strings a = c(1,2,3,"no") # creates a vector of numerics with a single string a = rnorm(1000,0,1) # 1000 draws from standard normal distribution
We can call or manipulate explicit parts of the vector using bracket notation. Array indexing in R starts with 1 at the first character (rather than 0).
a = b
We can also use conditional matching in the brackets. It’s important to realize that whatever is in the brackets is a vector of positions.
- Matrices are two-dimensional composite data structures. They can be created by binding together a number of equal length vectors using the rbind() or cbind() functions, or created directly by feeding a long vector in and declaring a row or column length. In their creation we use the matrix() function. Matrices can be used in matrix algebra.
a = matrix(c(1:20),byrow=TRUE,ncol=5) a = rbind(c(1:5),c(6:10),c(11:15),c(16:20))
Note that elements of the matrix can be accessed with bracket notation.
a[i,j] # Single value in i'th row, j'th column a[i,] # Vector of values from i'th row # - Single observation about a subject a[,j] # Vector of values from j'th column # - All observations of a given variable
- Data Frames are similar to matrices in that they contain vectors of matched length. The difference is that in data frames, the columns are named and can be accessed by those names. Data frames can be assembled by putting together vectors (in this case, the default column names are the vector names), or by coercing a matrix into a data frame (in which case, the default names for the vectors are “V1”, “V2”, and so on.
b = c(1:10) # 1 to 10 c = c(10:1) # 10 to 1 d = rnorm(10,0,1) # 10 draws from standard normal distribution a = data.frame(b,c,d) # bind vectors of equal length into data frame
An advantage of data frames is that we can manipulate or insert data by calling specific names in the data frame, or using the matrix notation in brackets previously discussed. For example, if we want to add another vector to the data frame a:
a$e = rexp(10,1) # 10 draws from exponential distribution a["f"] = rexp(10,1) # also 10 draws from exponential distribution
We can access the column vectors of a data frame in several ways. We can call them explicitly by name using the $ notation, we can call them again explicitly by name using the bracket notation, or we can call them by position using the matrix notation. The $ notation is easiest, but the bracket notation will help you deal with poorly named variables that were previously declared.
The default output of a read.table() function is a data frame. If we don’t get the vector names with a header, then we can specify them manually with the names() function. Alternatively, we can use the names() to read the current vector names of the data frame.
data = read.table("http://scg.sdsu.edu/wp-content/uploads/2013/09/brain_body.txt",skip=13,header=FALSE) names(data) = c("Index","Brain_Weight","Body_Weight")
Here data is a data frame created by the read.table() function. the names() function expects a vector of strings the same length as the number of columns. We don’t actually need the Index variable, but it is included with the table, so we need to import it before ignoring it.
- Models are the structures generated by any of the methods we used. They too can be termed and manipulated as variables. Parts of the model can be accessed with $ notation. If you are ever confused about the structure of the object, the str() function will tell you how to access any part of the object.
a = lm(Brain_Weight ~ Body_Weight, data = data) a$residuals str(a)
Coercion refers to R’s ability to coerce variables of a given type into variables of another type. For example, if you have variables of type integer, and a function expects numerics, then the integers can be easily coerced into numerics without any extra effort on the part of the programmer. Functions that expect data frames can generally be supplied with matrices (and visa-versa).
If you encounter an error in running code (and you will) where an object isn’t in the correct variable form, and isn’t being coerced properly, then you can manually coerce the object using the as.whatever() function.
a = 1:5 # a is now an integer vector a = as.numeric(a) # a is now a numeric vector a = as.character(a) # a is now a character vector a = as.matrix(a) # a is now a character matrix 5x1 a = as.numeric(a) # a is now a numeric matrix, 5x1 a = t(a) # a is now a numeric matrix, 1x5 # t(a) takes transpose of a a = as.vector(a) # a is now a numeric vector a = data.frame(a) # a is now a single column dataframe a = a$a # a is back to the numeric vector a = as.integer(a) # a is back to the integer vector
When coercing into matrices and dataframes, R will assume that vectors are single column matrices. When there is any doubt as to your intent, R will attempt to do what makes sense. This isn’t always what you had in mind, so be cautious.