next up previous
Next: Class Notes Up: stat3355 Previous: Syllabus

R Notes

The following links provide an excellent introduction to the use of R: (somewhat more advanced)
Other contributed books about R can be found on the CRAN site. Use the Contributed link under Documentation on the left side of the CRAN web page. Additional notes are provided below.

The S language was developed at Bell labs as a high-level computer language for statistical computations and graphics. It has some similarities with Matlab, but has some structures that Matlab does not have, such as data frames, that are natural data structures for statistical models. There are two implementations of this language currently available: a commercial product, S-Plus, and a freely available open-source product, R. R is available at
These implementations are mostly, but not completely, compatible.

Note: in the examples below the R prompt, > , is included but this would not be typed on the command line. It is used here to differentiate between input to R and output that is returned to the console after a command is entered.

On Linux or Macs, R can be run from a shell by entering
at a shell prompt. The R session is ended by entering
This will generate a query from R whether to save the workspace image. Enter n.

On Windows and Macs R is packaged as a windowed application that starts with a command window. RStudio also is a windowed application that includes a window for entering commands, a window that describes the property of objects that have been created during the session, and a window for graphics.

R's Workspace. The Workspace contains all the objects created or loaded during an R session. These objects only exist in the computer's memory, not on the physical hard drive and will disappear when R is exited. R offers a choice to the user when exiting: save the workspace or do not save it. If the Workspace is not saved, all objects created during the session will be lost. That's no problem if you are using it only as a mathematical or statistical calculator. If you are performing an analysis, but must exit before completing it, then you don't want to lose what you have already done. There is an alternative that I recommend instead of saving the workspace: write the commands you wish to enter into a text file and then copy/paste from the edit window into the R console. Even though this may seem like extra work, it has three advantages:

You must use a plain text editor to edit command files, not a document editor like Word. Both R and Rstudio include an editor for scripts that is accessed from their File menu.

Rstudio has an extensive set of resources to help users. Go to the Help tab on the right-hand window and click on An Introduction to R under Manuals. See section 2.1-2.7 for details about the following.

  1. The basic data structure in R is a vector. This is a set of objects all of which must have the same mode, either numeric, logical, character, or complex.

  2. Assignment is performed with the character = or the two characters <-. The second assignment operator is older but = is used more commonly now since it is just a single character. When an assignment is made, its value is not echoed to the terminal. Lines with no assignment do result in the value of the expression being echoed to the terminal.

  3. Sequences of integers can be generated by the colon expression,
    > x = 2:20
    > y = 15:1
    More general sequences can be generated with the seq() function. These operations produce vectors. Some examples:
    > seq(5)
    [1] 1 2 3 4 5
    > x = seq(2,20,length=5)
    > x
    [1]  2.0  6.5 11.0 15.5 20.0
    > y = seq(5,18,by=3)
    > y
    [1]  5  8 11 14 17
    The function c can be used to combine different vectors into a single vector.
    > c(x,y)
     [1]  2.0  6.5 11.0 15.5 20.0  5.0  8.0 11.0 14.0 17.0
    All vectors have an attribute named length which can be obtained by the function length()
    > length(c(x,y))
    [1] 10
    A scalar is just a vector of length 1.

  4. A useful function for creating strings is paste(). This function combines its arguments into strings. If all arguments have length 1, then the result is a single string. If all arguments are vectors with the same length, then the pasting is done element-wise and the result is a vector with the same length as the arguments. However, if some arguments are vectors with length greater than 1, and the others all have length 1, then the other arguments are replicated to have the same length and then pasted together element-wise. Numeric arguments are coerced to strings before pasting. Floating point values usually need to be rounded to control the number of decimal digits that are used. The default separator between arguments is a single space, but a different separator can be specified with the argument, sep=.
    > s = sum(x)
    > paste("Sum of x =",s)
    [1] "Sum of x = 55"
    > paste(x,y,sep=",")
    [1] "2,5"     "6.5,8"   "11,11"   "15.5,14" "20,17"
    > paste("X",seq(length(x)),sep="")
    [1] "X1" "X2" "X3" "X4" "X5"

  5. Vectors can have names which is useful for printing and for referencing particular elements of a vector. The function names() returns the names of a vector as well as assigning names to a vector.
    > names(x) = paste("X",seq(x),sep="")
    > x
      X1   X2   X3   X4   X5 
     2.0  6.5 11.0 15.5 20.0
    Elements of a vector are referenced by the function []. Arguments can be a vector of indices that refer to specific positions within the vector:
    > x[2:4]
      X2   X3   X4 
     6.5 11.0 15.5 
    > x[c(2,5)]
      X2   X5 
     6.5 20.0
    Elements also can be referenced by their names or by a logical vector in addition to their index:
    > x[c("X3","X4")]
      X3   X4 
    11.0 15.5
    > xl = x > 10
    > xl
       X1    X2    X3    X4    X5 
    > x[xl]
      X3   X4   X5 
    11.0 15.5 20.0
    The length of the referencing vector can be larger than the length of the vector that is being referenced as long as the referencing vector is either a vector of indices or names.
    > ndx = rep(seq(x),2)
    > ndx
     [1] 1 2 3 4 5 1 2 3 4 5
    > x[ndx]
      X1   X2   X3   X4   X5   X1   X2   X3   X4   X5
     2.0  6.5 11.0 15.5 20.0  2.0  6.5 11.0 15.5 20.0
    This is useful for table lookups. Suppose for example that Gender is a vector of elements that are either Male or Female:
    > Gender
    [1] "Male"   "Male"   "Female" "Male"   "Female"
    and Gcol is a vector of two colors whose names are the two unique elements of Gender
    > Gcol = c("blue","red")
    > names(Gcol) = c("Male","Female")
    > Gcol
      Male Female 
    "blue"  "red"
    > GenderCol = Gcol[Gender]
    > GenderCol
      Male   Male Female   Male Female 
    "blue" "blue"  "red" "blue"  "red"
    This will be useful for plotting data.

  6. R supports matrices and arrays of arbitrary dimensions. These can be created with the matrix and array functions. Arrays and matrices are stored internally in column-major order. For example,
    X = 1:10
    assigns to the object X the vector consisting of the integers 1 to 10.
    M = matrix(X,nrow=5)
    puts the entries of X into a matrix named M that has 5 rows and 2 columns. The first column of M contains the first 5 elements of X and the second column of M contains the remaining 5 elements. If a vector does not fit exactly into the dimensions of the matrix, then a warning is returned.
    > M
         [,1] [,2]
    [1,]    1    6
    [2,]    2    7
    [3,]    3    8
    [4,]    4    9
    [5,]    5   10
    The dimensions of a matrix are obtained by the function dim() which returns the number of rows and number of columns as a vector of length 2.
    > dim(M)
    [1] 5 2

  7. Elements of matrices and arrays are referenced using [] but with the number of arguments equal to the number of dimensions. A matrix has two dimensions, so M[2,1] refers to the element in row 2 and column 1.
    > X = matrix(runif(100),nrow=20)
    > X[2:5,2:4]
    [1,] 0.731622617 0.6578677 0.7446229
    [2,] 0.023472598 0.2111300 0.7775343
    [3,] 0.001858455 0.2887734 0.8103568
    [4,] 0.269611100 0.7527248 0.2127048
    Note: the function runif(n) returns a vector of n random numbers between 0 and 1. Each time the function runif is called it will return a new set of values. So if the runif() function is run again, a different set of values will be returned.
    Note: ff one of the arguments to [,] is empty, then all elements of that dimension are returned. So X[2:4,] gives all columns of rows 2,3,4 and so is a matrix with 3 rows and the same number of columns as X.

  8. Example. The file contains yearly sunspot numbers since 1700. Note that the first row of this file is not data but represents names for the columns. This file is an example of tabular data. Such data can be imported into R using the function read.table(). Further details about his function are given below.
    Sunspots = read.table("",header=TRUE)
    Note that the filename argument in this case is a web address. The argument also can be the name of a file on your computer. The second argument indicates that the first row of this file contains names for the columns. These are accessed by
    Suppose we wish to plot sunspot numbers versus year. There are several ways to accomplish this.
    plot(Sunspots[,1],Sunspots[,2], type="l")
    plot(Number ~ Year, data=Sunspots, type="l")
    The last method uses what is referred to as the formula interface for the plot function. Now let's add a title to make the plot more informative.
    title("Yearly mean total sunspot numbers")
    To be more informative, add the range of years contained in this data set.
    title("Yearly mean total sunspot numbers, 1700-2016")
    The title can be split into two lines as follows
    title("Yearly mean total sunspot numbers\n1700-2016")
    using the newline character \n. Note that this requires that we already know the range of years contained in the data. Alternatively, we could obtain that range from the data. That would make our command file more general. The following file contains these commands:

  9. Lists. A list is a structure whose components can be any type of object of any length. Lists can be created by the list function, and the components of a list can be accessed by appending a $ to the name of the list object followed by the name of the component. The dimension names of a matrix or array are a list with components that are the vectors of names for the respective dimensions. Components of a list also can be accessed by position using the [[]] function
    > X = seq(20)/2
    > Y = 2+6*X + rnorm(length(X),0,.5)
    > Z = matrix(runif(9),3,3)
    > = list(Var1=X,Var2=Y,Zmat=Z)
    > names(
    [1] "Var1" "Var2" "Zmat"
    [1] 0.5 1.0 1.5 2.0 2.5
    > data(state)
    > state.x77["Texas",]
    Population     Income Illiteracy   Life Exp     Murder    HS Grad
       12237.0     4188.0        2.2       70.9       12.2       47.4
          Frost       Area 
           35.0   262134.0

  10. The dimension names of a matrix can be set or accessed by the function dimnames(). For example, the row names for state.x77 are given by
    and the column names are given by
    These also can be used to set the dimension names of a matrix. For example, instead of using the full state names for this matrix, suppose we wanted to use just the 2-letter abbreviations:
    > StateData = state.x77
    > dimnames(StateData)[[1]] =

  11. Example. Suppose we wanted to find out which states have higher Illiteracy rates than Texas. We can do this by creating a logical vector that indicates which elements of the Illiteracy column are greater than the Illiteracy rate for Texas. That vector can be used to extract the names of states with lower Illiteracy rates.
    > txill = state.x77["Texas","Illiteracy"]
    > highIll = state.x77[,"Illiteracy"] > txill
    [1] "Louisiana"      "Mississippi"    "South Carolina"

  12. Matrix Operations. Matrix-matrix multiplication can be performed only when the two matrices are conformable, that is, their inner dimensions are the same. For example, if A is $n\times r$ and B is $r\times m$, then matrix-matrix multiplication of A and B is defined and results in a matrix C whose dimensions are $n\times m$. Elementwise multiplication of two matrices can be performed when both dimensions of the two matrices are the same. If for example D,E are $n\times m$ matrices, then
    F = D*E
    results in an $n\times m$ matrix F whose elements are

F[i,j] = D[i,j]*E[i,j],\ 1\le i\le n,\ 1\le j\le m.

    These two different types of multiplication operations must be differentiated by using different symbols, since both types would be possible if the matrices have the same dimensions. Matrix-matrix multiplication is denoted by $A\%*\%B$ and returns a matrix.

  13. Factors. A factor is a special type of character vector that is used to represent categorical variables. This structure is especially useful in statistical models such as ANOVA or general linear models. Associated with a factor variable are its levels, the set of unique character values in the vector. Although print methods for a factor will by default print a factor as a character vector, it is stored internally using integer positions of the values corresponding to the levels.

  14. A fundamental structure in the S language is the data frame. A data frame is like a matrix in that it is a two-dimensional array, but the difference is that the columns can be different data types. The following code generates a data frame named SAMP that has two numeric columns, one character column, and one logical column. It uses the function rnorm which generates a random sample for the standard normal distribution (bell-curve). Each time this code is run, different values will be obtained since each use of runif() and rnorm() produces new random samples.
    > y = matrix(rnorm(20),ncol=2)
    > x = rep(paste("A",1:2,sep=""),5)
    > z = runif(10) > .5
    > SAMP = data.frame(y,x,z)
               Y1         Y2  x     z
    1   0.2402750  1.3561348 A1 FALSE
    2   0.3669875 -1.4239780 A2 FALSE
    3  -1.5042563  1.2929657 A1  TRUE
    4   1.2329026  0.3838835 A2  TRUE
    5  -0.1241536 -0.5596217 A1  TRUE
    6  -0.1784147  1.2920853 A2 FALSE
    7  -1.2848231  1.7107087 A1  TRUE
    8   0.7731956  0.6520663 A2 FALSE
    9  -0.3515564  0.3169168 A1  TRUE
    10 -1.3513955  1.3663698 A2  TRUE
    Note that the rows and columns have names, referred to as dimnames. Arrays and data frames can be addressed through their names in addition to their position. Also note that variable x is a character vector, but the data.frame function automatically coerces that component to be a factor:
    > is.factor(x)
    [1] FALSE
    > is.factor(SAMP$x)
    [1] TRUE

  15. The S language is an object-oriented language. Many fundamental operations behave differently for different types of objects. For example, if the argument to the function sum() is a numeric vector, then the result will be the sum of its elements, but if the argument is a logical vector, then the result will be the number of TRUE elements. Also, the plot function will produce an ordinary scatterplot if its x,y arguments are both numeric vectors, but will produce a boxplot if the x argument is a factor:
    > plot(SAMP$Y1,SAMP$Y2)
    > plot(SAMP$x,SAMP$Y2)
    A better way to produce these plots is to use the formula interface along with the data= argument if the variables are contained within a data frame.
    > plot(Y2 ~ Y1, data=SAMP)
    > plot(Y2 ~ x, data=SAMP)

  16. Reading Data from files. The two main functions to read data that is contained in a file are scan() and read.table().
    scan(Fname) reads a file whose name is the value of Fname. All values in the file must be the same type (numeric, string, logical). By default, scan() reads numeric data. If the values in this file are not numeric, than the optional argument what= must be included. For example, if the file contains strings, then
    x = scan(Fname,what=character(0))
    will read this data. Note that Fname as used here is an R object whose value is the name of the file that contains the data.
    Note: if the file is not located in the working directory, then full path names must be used to specify the file. R uses unix conventions for path names regardless of the operating system. So, for example, in Windows a file located on the C-drive in folder StatData named Data1.txt would be scanned by
    x = scan("c:/StatData/Data1.txt")
    The file name argument also can be a web address.

    Data Frames and read.table(). Tabular data contained in a file can be read by R using the read.table() function. Each column in the table is treated as a separate variable and variables can be numeric, logical, or character (strings). That is, different columns can be different types, but each column must be the same type. An example of such a file is
    Note that the first few lines begin with the character $\char93 $. This is the comment character. R ignores that character and the remainder of the line. The first non-comment line contains names for the columns. In that case we must include the optional argument header=TRUE as follows:

    Temp = read.table("",
    The first column in this file is not really data, but just gives the name of each city in the data set. These can be used as row names:
    Temp = read.table("",
      header=TRUE, row.names=1)

    The value returned by read.table() is a data.frame. This type of object can be thought of as an enhanced matrix. It has a dimension just like a matrix, the value of which is a vector containing the number of rows and number of columns. However, a data frame is intended to represent a data set in which each row is the set of variables obtained for each subject in the sample and each column contains the observations for each variable being measured. In the case of the Temperature data, these variables are:
    JanTemp, Lat, Long
    Unlike a matrix, a data frame can have different types of variables, but each variable (column) must contain the same type.

    Individual variables in a data frame can be accessed several ways.

    1. Using $\$$
      Latitude = Temp$Lat
    2. Name:
      Latitude = Temp[["Lat"]]
    3. Number:
      Latitude = Temp[[2]]
    Note that the object named Latitude is a vector. If you want to extract a subset of the variables with all rows included, then use $[]$. The result is a data frame. If the original data frame has names, these are carried over to the new data frame. If you only want some of the rows, then specify these the way it is done with matrices:
    LatLong = Temp[2:3] #extract variables 2 through 3
    LatLong = Temp[c("Lat","Long")] #extract Lat and Long
    LatLong1 = Temp[1:20,c("Lat","Long")] #extract first 20 rows for Lat and Long
    Although it may seem like more work to use names, the advantage is that one does not need to know the index of the desired column, just its name.

    Additional variables can be added to a data frame as follows.

    #create new variable named Region with same length as other variables in Temp
    Region = rep("NE",dim(Temp)[1])
    # NE is defined to be Lat >= 39.75 and Long < 90
    # SE is defined to be Lat < 39.75 and Long < 90
    # SW is defined to be Lat < 39.75 and Long >= 90
    # NW is defined to be Lat >= 39.75 and Long >= 90
    Region[Temp$Lat < 39.75 & Temp$Long < 90] = "SE"
    Region[Temp$Lat < 39.75 & Temp$Long >= 90] = "SW"
    Region[Temp$Lat >= 39.75 & Temp$Long >= 90] = "NW"
    #give Region the same row names as Temp
    names(Region) = dimnames(Temp)[[1]]
    #make Region a factor
    Region = factor(Region)
    #add Region to Temp
    Temp1 = data.frame(Temp,Region)
    #plot January Temperature vs Region
    #since Region is a factor, this gives a boxplot
    plot(JanTemp ~ Region,data=Temp1)

  17. The plot() function is a top-level function that generates different types of plots depending on the types of its arguments. The formula interface is the recommended way to use this function, especially if the variables you wish to plot are contained within a data frame. When a plot() command (or any other top-level graphics function) is entered, then R closes any graphic device that currently is open and begins a new graphics window or file. Optional arguments include:

  18. Other functions add components to an existing graphic. These functions include
    title() Add a main title to the top of an existing graphic. Optional argument sub= adds a subtitle to the bottom.
    points(x,y) Add points at locations specified by the x,y coordinates. Optional arguments include pch= to use different plotting symbols, col= to use different colors for the points.
    lines(x,y) Add lines that join the points specified by x,y arguments. Optional arguments include lty= to use different line types, col= to use different colors for the points.
    text(x,y,labels=) Add strings at the locations specified by x,y arguments.
    mtext() Add text to margins of a plot.

  19. Accessing data in a spreadsheet. If a table of data is contained in a spreadsheet like Excel, then the easiest way to import it into R is to save the table as a comma-separated-values file. Then use read.table() to read the file with separator argument sep=",". The file
    can be read into R by
    Smoke =  read.table("",
    Note that 2 of the entries in this table are NA. These denote types of cancer that were not reported in that state during the time period covered by the data. We can change those entries to 0 as follows.
    Smoke[] = 0

    There is a companion function, write.table(), that can be used to write a matrix or data frame to a file that then can be imported into a spreadsheet.

  20. Saving graphics. By default R uses a separate graphical window for the display of graphic commands. A graphic can be saved to a file using any of several different graphical file types. The most commonly used are pdf() and png() since these types can be imported into documents created by Word or LATEX. The first argument for these functions is the filename. Arguments width=,height= give the dimensions of the graphic. For pdf() the dimension units are inches, for png() the units are pixels. pdf() supports multi-page graphics, but png() only allows one page per file unless the file name has the form Myplot%d.png. For example,
    plot(JanTemp ~ Lat,data=Temp)
    plot(JanTemp ~ Region,data=Temp1)
    #creates a 2-page pdf document
    plot(JanTemp ~ Lat,data=Temp)
    plot(JanTemp ~ Region,data=Temp1)
    #creates two files: TempPlot1.png and TempPlot2.png
    The function writes any closing material required by the graphic file type and then closes the graphics file.

  21. RStudio includes a plot tab where plots are displayed. After creating a plot, it can be exported to a graphic file that can be added to a Word document. This is done via the Export link on the plot tab using the Save as image selecion. The most widely used image file type is png.

There are a number of datasets included in the R distribution along with examples of their use in the help pages. One example is given below.

# load cars data frame
# plot braking distance vs speed with custom x-labels and y-labels,
# and axis numbers horizontal
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
     las = 1)
# add plot title
title(main = "Cars data")
# new plot of same variables on a log scale for both axes
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
     las = 1, log = "xy")
# add plot title
title(main = "Cars data (logarithmic scales)")
# fit a regression model using log(speed) to predict log(dist) and
# print a summary of the fit
summary(fm1 = lm(log(dist) ~ log(speed), data = cars))
# save the current plotting parameters and then setup a new plot
# region that puts 4 plots on the same page, 2 rows and 2 columns.
# use custom margins for the plot region.
opar = par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0),
            mar = c(4.1, 4.1, 2.1, 1.1))
# plot the diagnostic residual plots associated with a regression fit.
# restore the original plotting parameters

next up previous
Next: Class Notes Up: stat3355 Previous: Syllabus