next up previous
Next: Examples Up: Software for Statistical Analysis Previous: Software for Statistical Analysis

R Notes

The following links provide introductions to the use of R:
http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
http://cran.r-project.org/doc/contrib/usingR.pdf
Other contributed books about R can be found on the CRAN site. Use the Contributed link under Documentation on the left side of the CRAN web page. Additional notes are provided below.

The S language was developed at Bell labs as a high-level computer language for statistical computations and graphics. It has some similarities with Matlab, but has some structures that Matlab does not have, such as data frames, that are natural data structures for statistical models. There are two implementations of this language currently available: a commercial product, S-Plus, and a freely available open-source product, R. R is available at
http://cran.r-project.org
These implementations are mostly, but not completely, compatible.

Note: in the examples below, the R prompt, > , is included, but this would not be typed on the command line. Lines that do not begin with this prompt represent what is returned by R.

  1. R is started by entering
    R
    at a shell prompt and it is ended by entering
    q()
    This will generate a query from R whether to save the workspace image. Enter n (details about workspace images are given on page 6 of R-intro).

  2. The Workspace. The Workspace contains all the objects created or loaded during an R session. These objects only exist in the computer's memory, not on the physical hard drive and will disappear when R is exited. R offers a choice to the user when exiting: save the workspace or do not save it. If the Workspace is not saved, all objects created during the session will be lost. That's no problem if you are using it only as a mathematical or statistical calculator. If you are performing an analysis, but must exit before completing it, then you don't want to lose what you have already done. There is an alternative that I recommend instead of saving the workspace: write the commands you wish to enter into a text file and then copy/paste from the edit window into the R console. Even though this may seem like extra work, it has three advantages: 1) any mistakes can be correctly immediately with the editor; 2) you won't have to remember what the objects in a workspace represent since the file contains the commands that created those objects; 3) if you need to perform a similar analysis at a later time, you can just copy the original file to a new name and modify/extend the commands in the file to complete the later analysis. You must use a plain text editor to edit command files, not a document editor like Word

  3. Help is obtained by entering
    help.start()
    at the prompt. This will start a browser with the entry help page.
  4. Assignment is performed with the character: = The value of an assignment is not automatically echoed to the terminal. Lines with no assignment do result in the value of the expression being echoed to the terminal.

  5. Sequences of integers can be generated by the colon expression,
    > x = 2:20
    > y = 15:1
    
    More general sequences can be generated with the seq() function. These operations produce vectors. Some examples:
    > seq(5)
    [1] 1 2 3 4 5
    > x = seq(2,20,length=5)
    > x
    [1]  2.0  6.5 11.0 15.5 20.0
    > y = seq(5,18,by=3)
    > y
    [1]  5  8 11 14 17
    
    The function c can be used to combine different vectors into a single vector.
    > c(x,y)
     [1]  2.0  6.5 11.0 15.5 20.0  5.0  8.0 11.0 14.0 17.0
    
    All vectors have a length which can be obtained by the function length()
    > length(c(x,y))
    [1] 10
    

  6. Vectors in R can have elements that are numeric, logical, or strings, but all elements of a vector must be the same type. A useful function for creating strings is paste(). This function combines its arguments into strings. If all arguments have length 1, then the result is a single string. If all arguments are vectors with the same length, then the pasting is done element-wise and the result is a vector with the same length as the arguments. However, if some arguments are vectors with length greater than 1, and the others all have length 1, then the other arguments are replicated to have the same length and then pasted together element-wise. Numeric arguments are coerced to strings before pasting. Floating point values usually need to be rounded to control the number of decimal digits that are used. The default separator between arguments is a single space, but a different separator can be specified with the argument, sep=.
    > s = sum(x)
    > paste("Sum of x =",s)
    [1] "Sum of x = 55"
    > paste(x,y,sep=",")
    [1] "2,5"     "6.5,8"   "11,11"   "15.5,14" "20,17"
    > paste("X",seq(x),sep="")
    [1] "X1" "X2" "X3" "X4" "X5"
    
    Note that the last example uses the expression seq(x). If the argument to seq() is a vector, then this expression is equivalent to seq(length(x)).

  7. Vectors can have names which is useful for printing and for referencing particular elements of a vector. The function names() returns the names of a vector as well as assigning names to a vector.
    > names(x) = paste("X",seq(x),sep="")
    > x
      X1   X2   X3   X4   X5 
     2.0  6.5 11.0 15.5 20.0
    
    Elements of a vector are referenced by the function []. Arguments can be a vector of indices that refer to specific positions within the vector:
    > x[2:4]
      X2   X3   X4 
     6.5 11.0 15.5 
    > x[c(2,5)]
      X2   X5 
     6.5 20.0
    
    Elements can be referenced by their names or by a logical vector
    > x[c("X3","X4")]
      X3   X4 
    11.0 15.5
    > xl = x > 10
    > xl
       X1    X2    X3    X4    X5 
    FALSE FALSE  TRUE  TRUE  TRUE
    > x[xl]
      X3   X4   X5 
    11.0 15.5 20.0
    
    The length of the referencing vector can be larger than the length of the vector that is being referenced as long as the referencing vector is either a vector of indices or names.
    > ndx = rep(seq(x),2)
    > ndx
     [1] 1 2 3 4 5 1 2 3 4 5
    > x[ndx]
      X1   X2   X3   X4   X5   X1   X2   X3   X4   X5
     2.0  6.5 11.0 15.5 20.0  2.0  6.5 11.0 15.5 20.0
    
    This is useful for table lookups. Suppose for example that Gender is a vector of elements that are either Male or Female:
    > Gender
    [1] "Male"   "Male"   "Female" "Male"   "Female"
    
    and Gcol is a vector of two colors whose names are the two unique elements of Gender
    > Gcol = c("blue","red")
    > names(Gcol) = c("Male","Female")
    > Gcol
      Male Female 
    "blue"  "red"
    > GenderCol = Gcol[Gender]
    > GenderCol
      Male   Male Female   Male Female 
    "blue" "blue"  "red" "blue"  "red"
    
    This will be useful for plotting data.

  8. R supports matrices and arrays of arbitrary dimensions. These can be created with the matrix and array functions. Arrays and matrices are stored internally in column-major order. For example,
    X = 1:10
    
    assigns to the object X the vector consisting of the integers 1 to 10.
    M = matrix(X,nrow=5)
    
    puts the entries of X into a matrix named M that has 5 rows and 2 columns. The first column of M contains the first 5 elements of X and the second column of M contains the remaining 5 elements. If a vector does not fit exactly into the dimensions of the matrix, then a warning is returned.
    > M
         [,1] [,2]
    [1,]    1    6
    [2,]    2    7
    [3,]    3    8
    [4,]    4    9
    [5,]    5   10
    
    The dimensions of a matrix are obtained by the function dim() which returns the number of rows and number of columns in a vector of length 2
    > dim(M)
    [1] 5 2
    

  9. Elements of matrices and arrays are referenced using [] but with the number of arguments equal to the number of dimensions. A matrix has two dimensions, so M[2,1] refers to the element in row 2 and column 1.
    > X = matrix(runif(100),nrow=20)
    > X[2:5,2:4]
    [1,] 0.731622617 0.6578677 0.7446229
    [2,] 0.023472598 0.2111300 0.7775343
    [3,] 0.001858455 0.2887734 0.8103568
    [4,] 0.269611100 0.7527248 0.2127048
    
    Note: the function runif(n) returns a vector of n random numbers between 0 and 1. If one of the arguments to [,] is empty, then all elements of that dimension are returned. So X[2:4,] gives all columns of rows 2,3,4 and so is a matrix with 3 rows and the same number of columns as X.

  10. Lists. A list is a structure whose components can be any type of object of any length. Lists can be created by the list function, and the components of a list can be accessed by appending a $ to the name of the list object followed by the name of the component. The dimension names of a matrix or array are a list with components that are the vectors of names for the respective dimensions. Components of a list also can be accessed by position using the [[]] function
    > X = seq(20)/2
    > Y = 2+6*X + rnorm(length(X),0,.5)
    > Z = matrix(runif(9),3,3)
    > All.data = list(Var1=X,Var2=Y,Zmat=Z)
    > names(All.data)
    [1] "Var1" "Var2" "Zmat"
    > All.data$Var1[1:5]
    [1] 0.5 1.0 1.5 2.0 2.5
    > data(state)
    > state.x77["Texas",]
    Population     Income Illiteracy   Life Exp     Murder    HS Grad
       12237.0     4188.0        2.2       70.9       12.2       47.4
          Frost       Area 
           35.0   262134.0
    

  11. The dimension names of a matrix can be set or accessed by the function dimnames(). For example, the row names for state.x77 are given by
    dimnames(state.x77)[[1]]
    and the column names are given by
    dimnames(state.x77)[[2]]
    These also can be used to set the dimension names of a matrix. For example, instead of using the full state names for this matrix, suppose we wanted to use just the 2-letter abbreviations:
    > StateData = state.x77
    > dimnames(StateData)[[1]] = state.abb
    

  12. Example. Suppose we wanted to find out which states have higher Illiteracy rates than Texas. We can do this by creating a logical vector that indicates which elements of the Illiteracy column are greater than the Illiteracy rate for Texas. Then that vector can be used to extract the names of states with lower Illiteracy rates.
    > txill = state.x77["Texas","Illiteracy"]
    > highIll = state.x77[,"Illiteracy"] > txill
    > state.name[highIll]
    [1] "Louisiana"      "Mississippi"    "South Carolina"
    

  13. Matrix Operations. Matrix-matrix multiplication can be performed only when the two matrices are conformable, that is, their inner dimensions are the same. For example, if A is $n\times r$ and B is $r\times m$, then matrix-matrix multiplication of A and B is defined and results in a matrix C whose dimensions are $n\times m$. Elementwise multiplication of two matrices can be performed when both dimensions of the two matrices are the same. If for example D,E are $n\times m$ matrices, then
    F = D*E
    
    results in an $n\times m$ matrix F whose elements are

    \begin{displaymath}
F[i,j] = D[i,j]*E[i,j],\ 1\le i\le n,\ 1\le j\le m.
\end{displaymath}

    These two different types of multiplication operations must be differentiated by using different symbols, since both types would be possible if the matrices have the same dimensions. Matrix-matrix multiplication is denoted by $A\%*\%B$ and elementwise multiplication is denoted by $A*B$. The same situation occurs if A and B are both vectors that have the same length n. In that case, $A\%*\%B$ represents the dot product of these vectors,

    \begin{displaymath}
A\%*\%B = \sum_{i=1}^n A_iB_i.
\end{displaymath}

    Note that this result is a scalar. $A*B$ represents elementwise multiplication. The result is a vector C with $c[i] = a[i]*b[i]$.

  14. Factors. A factor is a special type of character vector that is used to represent categorical variables. This structure is especially useful in statistical models such as ANOVA or general linear models. Associated with a factor variable are its levels, the set of unique character values in the vector. Although print methods for a factor will by default print a factor as a character vector, it is stored internally using integer positions of the values corresponding to the levels.

  15. A fundamental structure in the S language is the data frame. A data frame is like a matrix in that it is a two-dimensional array, but the difference is that the columns can be different data types. The following code generates a data frame named SAMP that has two numeric columns, one character column, and one logical column. It uses the function rnorm which generates a random sample for the standard normal distribution (bell-curve). Each time this code is run, different values will be obtained since each use of runif() and rnorm() produces new random samples.
    > y = matrix(rnorm(20),ncol=2)
    > x = rep(paste("A",1:2,sep=""),5)
    > z = runif(10) > .5
    > SAMP = data.frame(y,x,z)
               Y1         Y2  x     z
    1   0.2402750  1.3561348 A1 FALSE
    2   0.3669875 -1.4239780 A2 FALSE
    3  -1.5042563  1.2929657 A1  TRUE
    4   1.2329026  0.3838835 A2  TRUE
    5  -0.1241536 -0.5596217 A1  TRUE
    6  -0.1784147  1.2920853 A2 FALSE
    7  -1.2848231  1.7107087 A1  TRUE
    8   0.7731956  0.6520663 A2 FALSE
    9  -0.3515564  0.3169168 A1  TRUE
    10 -1.3513955  1.3663698 A2  TRUE
    
    Note that the rows and columns have names, referred to as dimnames. Arrays and data frames can be addressed through their names in addition to their position. Also note that variable x is a character vector, but the data.frame function automatically coerces that component to be a factor:
    > is.factor(x)
    [1] FALSE
    > is.factor(SAMP$x)
    [1] TRUE
    

  16. The S language is an object-oriented language. Many fundamental operations behave differently for different types of objects. For example, if the argument to the function sum() is a numeric vector, then the result will be the sum of its elements, but if the argument is a logical vector, then the result will be the number of TRUE elements. Also, the plot function will produce an ordinary scatterplot if its x,y arguments are both numeric vectors, but will produce a boxplot if the x argument is a factor:
    > plot(SAMP$Y1,SAMP$Y2)
    > plot(SAMP$x,SAMP$Y2)
    
    If the argument to plot is the result of a linear model fit using lm(), then the plot function will produce a set of diagnostic residual plots. (More about this later in the course.)

  17. Reading Data from files. The two main functions to read data that is contained in a file are scan() and read.table().
    scan(Fname) reads a file whose name is the value of Fname. All values in the file must be the same type (numeric, string, logical). By default, scan() reads numeric data. If the values in this file are not numeric, than the optional argument what= must be included. For example, if the file contains strings, then
    x = scan(Fname,what=character(0))
    will read this data. Note that Fname as used here is an R object whose value is the name of the file that contains the data.
    Note: if the file is not located in the working directory, then full path names must be used to specify the file. R uses unix conventions for path names regardless of the operating system. So, for example, in Windows a file located on the C-drive in folder StatData named Data1.txt would be scanned by
    x = scan("c:/StatData/Data1.txt")
    
    The file name argument also can be a web address.

    Data Frames and read.table(). Tabular data contained in a file can be read by R using the read.table() function. Each column in the table is treated as a separate variable and variables can be numeric, logical, or character (strings). That is, different columns can be different types, but each column must be the same type. An example of such a file is
    http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data
    Note that the first few lines begin with the character $\char93 $. This is the comment character. R ignores that character and the remainder of the line. The first non-comment line contains names for the columns. In that case we must include the optional argument header=TRUE as follows:

    Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data",
      header=TRUE)
    
    The first column in this file is not really data, but just gives the name of each city in the data set. These can be used as row names:
    Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data",
      header=TRUE,
      row.names=1)
    

  18. The value returned by read.table() is a data.frame. This type of object can be thought of as an enhanced matrix. It has a dimension just like a matrix, the value of which is a vector containing the number of rows and number of columns. However, a data frame is intended to represent a data set in which each row is the set of variables obtained for each subject in the sample and each column contains the observations for each variable being measured. In the case of the Temperature data, these variables are:
    JanTemp, Lat, Long
    Unlike a matrix, a data frame can have different types of variables, but each variable (column) must contain the same type.

    Individual variables in a data frame can be accessed several ways.

    1. Using $\$$
      Latitude = Temp$Lat
      
    2. Name:
      Latitude = Temp[["Lat"]]
      
    3. Number:
      Latitude = Temp[[2]]
      
    Note that the object named Latitude is a vector. If you want to extract a subset of the variables with all rows included, then use $[]$. The result is a data frame. If the original data frame has names, these are carried over to the new data frame. If you only want some of the rows, then specify these the way it is done with matrices:
    LatLong = Temp[2:3] #extract variables 2 through 3
    LatLong = Temp[c("Lat","Long")] #extract Lat and Long
    LatLong1 = Temp[1:20,c("Lat","Long")] #extract first 20 rows for Lat and Long
    
    Although it may seem like more work to use names, the advantage is that one does not need to know the index of the desired column, just its name.

    Additional variables can be added to a data frame as follows.

    #create new variable named Region with same length as other variables in Temp
    Region = rep("NE",dim(Temp)[1])
    # NE is defined to be Lat >= 39.75 and Long < 90
    # SE is defined to be Lat < 39.75 and Long < 90
    # SW is defined to be Lat < 39.75 and Long >= 90
    # NW is defined to be Lat >= 39.75 and Long >= 90
    Region[Temp$Lat < 39.75 & Temp$Long < 90] = "SE"
    Region[Temp$Lat < 39.75 & Temp$Long >= 90] = "SW"
    Region[Temp$Lat >= 39.75 & Temp$Long >= 90] = "NW"
    #give Region the same row names as Temp
    names(Region) = dimnames(Temp)[[1]]
    #make Region a factor
    Region = factor(Region)
    #add Region to Temp
    Temp1 = data.frame(Temp,Region)
    #plot January Temperature vs Region
    #since Region is a factor, this gives a boxplot
    plot(JanTemp ~ Region,data=Temp1)
    

  19. Accessing data in a spreadsheet. If a table of data is contained in a spreadsheet like Excel, then the easiest way to import it into R is to save the table as a comma-separated-values file. Then use read.table() to read the file with separator argument sep='',''. The file
    http://www.utdallas.edu/~ammann/SmokeCancer.csv
    can be read into R by
    Smoke =  read.table("http://www.utdallas.edu/~ammann/SmokeCancer.csv",
      header=TRUE,sep=",",row.names=1)
    
    Note that 2 of the entries in this table are NA. These denote types of cancer that were not reported in that state during the time period covered by the data. We can change those entries to 0 as follows.
    Smoke[is.na(Smoke)] = 0
    

    There is a companion function, write.table(), that can be used to write a matrix or data frame to a file that then can be imported into a spreadsheet.

  20. Saving graphics. By default R uses a separate graphical window for the display of graphic commands. A graphic can be saved to a file using any of several different graphical file types. The most commonly used are pdf() and png() since these types can be imported into documents created by Word or LATEX. The first argument for these functions is the filename. Arguments width=,height= give the dimensions of the graphic. For pdf() the dimension units are inches, for png() the units are pixels. pdf() supports multi-page graphics, but png() only allows one page per file unless the file name has the form Myplot%d.png. For example,
    pdf("TempPlot.pdf",width=6,height=6)
    plot(JanTemp ~ Lat,data=Temp)
    plot(JanTemp ~ Region,data=Temp1)
    graphics.off()
    #creates a 2-page pdf document
    png("TempPlot%d.png",width=480,height=480)
    plot(JanTemp ~ Lat,data=Temp)
    plot(JanTemp ~ Region,data=Temp1)
    graphics.off()
    #creates two files: TempPlot1.png and TempPlot2.png
    
    The function graphics.off() writes any closing material required by the graphic file type and then closes the graphics file.

Exercise. Work the exercises given in Section 2.4 and 3.4 of the Owen-TheRGuide document referenced above.

There are a number of datasets included in the R distribution along with examples of their use in the help pages. One example is given below.

# load cars data frame
data(cars)
# plot braking distance vs speed with custom x-labels and y-labels,
# and axis numbers horizontal
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
     las = 1)
# add lowess line (smoothed trend line)
lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red")
# add plot title
title(main = "cars data")
# new plot of same variables on a log scale for both axes
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
     las = 1, log = "xy")
# add plot title
title(main = "cars data (logarithmic scales)")
# add lowess line (smoothed trend line) for the log scale
lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red")
# fit a regression model using log(speed) to predict log(dist) and
# print a summary of the fit
summary(fm1 = lm(log(dist) ~ log(speed), data = cars))
# save the current plotting parameters and then setup a new plot
# region that puts 4 plots on the same page, 2 rows and 2 columns.
# use custom margins for the plot region.
opar = par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0),
            mar = c(4.1, 4.1, 2.1, 1.1))
# plot the diagnostic residual plots associated with a regression fit.
plot(fm1)
# restore the original plotting parameters
par(opar)


next up previous
Next: Examples Up: Software for Statistical Analysis Previous: Software for Statistical Analysis
Larry Ammann
2014-10-14