The following links provide introductions to the use of **R**:

http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf

http://cran.r-project.org/doc/contrib/usingR.pdf

Other contributed books about **R** can be found on the **CRAN** site. Use the
**Contributed** link under Documentation on the left side of the **CRAN** web page.
Additional notes are provided below.

The **S** language was developed at Bell labs as a high-level computer
language for statistical computations and graphics. It has some similarities with
**Matlab**, but has some structures that **Matlab** does not have,
such as *data frames*, that are natural data structures for statistical
models. There are two implementations of this language currently available:
a commercial product, *S-Plus*, and a freely available open-source
product, **R**. **R** is available at

http://cran.r-project.org

These implementations are mostly, but not completely, compatible.

**Note**: in the examples below, the **R** prompt, `> `

, is included, but this
would not be typed on the command line. Lines that do not begin with this
prompt represent what is returned by **R**.

**R**is started by entering`R`

at a shell prompt and it is ended by entering`q()`

This will generate a query from**R**whether to save the workspace image. Enter`n`(details about workspace images are given on page 6 of*R-intro*).**The Workspace**. The*Workspace*contains all the objects created or loaded during an**R**session. These objects only exist in the computer's memory, not on the physical hard drive and will disappear when**R**is exited.**R**offers a choice to the user when exiting: save the workspace or do not save it. If the Workspace is not saved, all objects created during the session will be lost. That's no problem if you are using it only as a mathematical or statistical calculator. If you are performing an analysis, but must exit before completing it, then you don't want to lose what you have already done. There is an alternative that I recommend instead of saving the workspace: write the commands you wish to enter into a text file and then copy/paste from the edit window into the**R**console. Even though this may seem like extra work, it has three advantages: 1) any mistakes can be correctly immediately with the editor; 2) you won't have to remember what the objects in a workspace represent since the file contains the commands that created those objects; 3) if you need to perform a similar analysis at a later time, you can just copy the original file to a new name and modify/extend the commands in the file to complete the later analysis.**You must use a plain text editor to edit command files, not a document editor like**`Word`- Help is obtained by entering
`help.start()`

at the prompt. This will start a browser with the entry help page. - Assignment is performed with the character:
`=`

The value of an assignment is not automatically echoed to the terminal. Lines with no assignment do result in the value of the expression being echoed to the terminal. - Sequences of integers can be generated by the colon expression,
> x = 2:20 > y = 15:1

More general sequences can be generated with the`seq()`function. These operations produce vectors. Some examples:> seq(5) [1] 1 2 3 4 5 > x = seq(2,20,length=5) > x [1] 2.0 6.5 11.0 15.5 20.0 > y = seq(5,18,by=3) > y [1] 5 8 11 14 17

The function`c`can be used to combine different vectors into a single vector.> c(x,y) [1] 2.0 6.5 11.0 15.5 20.0 5.0 8.0 11.0 14.0 17.0

All vectors have a length which can be obtained by the function`length()`> length(c(x,y)) [1] 10

- Vectors in
**R**can have elements that are numeric, logical, or strings, but all elements of a vector must be the same type. A useful function for creating strings is`paste()`. This function combines its arguments into strings. If all arguments have length 1, then the result is a single string. If all arguments are vectors with the same length, then the pasting is done element-wise and the result is a vector with the same length as the arguments. However, if some arguments are vectors with length greater than 1, and the others all have length 1, then the other arguments are replicated to have the same length and then pasted together element-wise. Numeric arguments are coerced to strings before pasting. Floating point values usually need to be rounded to control the number of decimal digits that are used. The default separator between arguments is a single space, but a different separator can be specified with the argument,`sep=`.> s = sum(x) > paste("Sum of x =",s) [1] "Sum of x = 55" > paste(x,y,sep=",") [1] "2,5" "6.5,8" "11,11" "15.5,14" "20,17" > paste("X",seq(x),sep="") [1] "X1" "X2" "X3" "X4" "X5"

Note that the last example uses the expression`seq(x)`. If the argument to`seq()`is a vector, then this expression is equivalent to`seq(length(x))`. - Vectors can have names which is useful for printing and for referencing
particular elements of a vector. The function
`names()`returns the names of a vector as well as assigning names to a vector.> names(x) = paste("X",seq(x),sep="") > x X1 X2 X3 X4 X5 2.0 6.5 11.0 15.5 20.0

Elements of a vector are referenced by the function`[]`. Arguments can be a vector of indices that refer to specific positions within the vector:> x[2:4] X2 X3 X4 6.5 11.0 15.5 > x[c(2,5)] X2 X5 6.5 20.0

Elements can be referenced by their names or by a logical vector> x[c("X3","X4")] X3 X4 11.0 15.5 > xl = x > 10 > xl X1 X2 X3 X4 X5 FALSE FALSE TRUE TRUE TRUE > x[xl] X3 X4 X5 11.0 15.5 20.0

The length of the referencing vector can be larger than the length of the vector that is being referenced as long as the referencing vector is either a vector of indices or names.> ndx = rep(seq(x),2) > ndx [1] 1 2 3 4 5 1 2 3 4 5 > x[ndx] X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 2.0 6.5 11.0 15.5 20.0 2.0 6.5 11.0 15.5 20.0

This is useful for table lookups. Suppose for example that`Gender`is a vector of elements that are either*Male*or*Female*:> Gender [1] "Male" "Male" "Female" "Male" "Female"

and`Gcol`is a vector of two colors whose names are the two unique elements of`Gender`> Gcol = c("blue","red") > names(Gcol) = c("Male","Female") > Gcol Male Female "blue" "red" > GenderCol = Gcol[Gender] > GenderCol Male Male Female Male Female "blue" "blue" "red" "blue" "red"

This will be useful for plotting data. **R**supports matrices and arrays of arbitrary dimensions. These can be created with the`matrix`and`array`functions. Arrays and matrices are stored internally in column-major order. For example,X = 1:10

assigns to the object`X`the vector consisting of the integers 1 to 10.M = matrix(X,nrow=5)

puts the entries of`X`into a matrix named`M`that has 5 rows and 2 columns. The first column of`M`contains the first 5 elements of`X`and the second column of`M`contains the remaining 5 elements. If a vector does not fit exactly into the dimensions of the matrix, then a warning is returned.> M [,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10

The dimensions of a matrix are obtained by the function`dim()`which returns the number of rows and number of columns in a vector of length 2> dim(M) [1] 5 2

- Elements of matrices and arrays are referenced using
`[]`

but with the number of arguments equal to the number of dimensions. A matrix has two dimensions, so`M[2,1]`refers to the element in row 2 and column 1.> X = matrix(runif(100),nrow=20) > X[2:5,2:4] [1,] 0.731622617 0.6578677 0.7446229 [2,] 0.023472598 0.2111300 0.7775343 [3,] 0.001858455 0.2887734 0.8103568 [4,] 0.269611100 0.7527248 0.2127048

**Note**: the function`runif(n)`returns a vector of*n*random numbers between 0 and 1. If one of the arguments to`[,]`is empty, then all elements of that dimension are returned. So`X[2:4,]`gives all columns of rows 2,3,4 and so is a matrix with 3 rows and the same number of columns as`X`. **Lists**. A*list*is a structure whose components can be any type of object of any length. Lists can be created by the*list*function, and the components of a list can be accessed by appending a $ to the name of the list object followed by the name of the component. The dimension names of a matrix or array are a list with components that are the vectors of names for the respective dimensions. Components of a list also can be accessed by position using the`[[]]`function> X = seq(20)/2 > Y = 2+6*X + rnorm(length(X),0,.5) > Z = matrix(runif(9),3,3) > All.data = list(Var1=X,Var2=Y,Zmat=Z) > names(All.data) [1] "Var1" "Var2" "Zmat" > All.data$Var1[1:5] [1] 0.5 1.0 1.5 2.0 2.5 > data(state) > state.x77["Texas",] Population Income Illiteracy Life Exp Murder HS Grad 12237.0 4188.0 2.2 70.9 12.2 47.4 Frost Area 35.0 262134.0

- The dimension names of a matrix can be set or accessed by the function
`dimnames()`. For example, the row names for`state.x77`are given by`dimnames(state.x77)[[1]]`

and the column names are given by`dimnames(state.x77)[[2]]`

These also can be used to set the dimension names of a matrix. For example, instead of using the full state names for this matrix, suppose we wanted to use just the 2-letter abbreviations:> StateData = state.x77 > dimnames(StateData)[[1]] = state.abb

**Example**. Suppose we wanted to find out which states have higher Illiteracy rates than Texas. We can do this by creating a logical vector that indicates which elements of the Illiteracy column are greater than the Illiteracy rate for Texas. Then that vector can be used to extract the names of states with lower Illiteracy rates.> txill = state.x77["Texas","Illiteracy"] > highIll = state.x77[,"Illiteracy"] > txill > state.name[highIll] [1] "Louisiana" "Mississippi" "South Carolina"

**Matrix Operations**. Matrix-matrix multiplication can be performed only when the two matrices are conformable, that is, their inner dimensions are the same. For example, if*A*is and*B*is , then matrix-matrix multiplication of*A*and*B*is defined and results in a matrix*C*whose dimensions are . Elementwise multiplication of two matrices can be performed when both dimensions of the two matrices are the same. If for example*D,E*are matrices, thenF = D*E

results in an matrix*F*whose elements are

These two different types of multiplication operations must be differentiated by using different symbols, since both types would be possible if the matrices have the same dimensions. Matrix-matrix multiplication is denoted by and elementwise multiplication is denoted by . The same situation occurs if*A*and*B*are both vectors that have the same length*n*. In that case, represents the*dot product*of these vectors,

Note that this result is a scalar. represents elementwise multiplication. The result is a vector*C*with .**Factors**. A*factor*is a special type of character vector that is used to represent categorical variables. This structure is especially useful in statistical models such as ANOVA or general linear models. Associated with a factor variable are its levels, the set of unique character values in the vector. Although print methods for a factor will by default print a factor as a character vector, it is stored internally using integer positions of the values corresponding to the levels.- A fundamental structure in the
**S**language is the*data frame*. A data frame is like a matrix in that it is a two-dimensional array, but the difference is that the columns can be different data types. The following code generates a data frame named`SAMP`that has two numeric columns, one character column, and one logical column. It uses the function`rnorm`which generates a random sample for the standard normal distribution (bell-curve). Each time this code is run, different values will be obtained since each use of`runif()`and`rnorm()`produces new random samples.> y = matrix(rnorm(20),ncol=2) > x = rep(paste("A",1:2,sep=""),5) > z = runif(10) > .5 > SAMP = data.frame(y,x,z) Y1 Y2 x z 1 0.2402750 1.3561348 A1 FALSE 2 0.3669875 -1.4239780 A2 FALSE 3 -1.5042563 1.2929657 A1 TRUE 4 1.2329026 0.3838835 A2 TRUE 5 -0.1241536 -0.5596217 A1 TRUE 6 -0.1784147 1.2920853 A2 FALSE 7 -1.2848231 1.7107087 A1 TRUE 8 0.7731956 0.6520663 A2 FALSE 9 -0.3515564 0.3169168 A1 TRUE 10 -1.3513955 1.3663698 A2 TRUE

Note that the rows and columns have names, referred to as`dimnames`. Arrays and data frames can be addressed through their names in addition to their position. Also note that variable*x*is a character vector, but the*data.frame*function automatically coerces that component to be a factor:> is.factor(x) [1] FALSE > is.factor(SAMP$x) [1] TRUE

- The
**S**language is an object-oriented language. Many fundamental operations behave differently for different types of objects. For example, if the argument to the function`sum()`is a numeric vector, then the result will be the sum of its elements, but if the argument is a logical vector, then the result will be the number of TRUE elements. Also, the*plot*function will produce an ordinary scatterplot if its*x,y*arguments are both numeric vectors, but will produce a boxplot if the*x*argument is a factor:> plot(SAMP$Y1,SAMP$Y2) > plot(SAMP$x,SAMP$Y2)

If the argument to*plot*is the result of a linear model fit using`lm()`, then the plot function will produce a set of diagnostic residual plots. (More about this later in the course.) **Reading Data from files**. The two main functions to read data that is contained in a file are*scan()*and*read.table()*.*scan(Fname)*reads a file whose name is the value of`Fname`. All values in the file must be the same type (numeric, string, logical). By default,*scan()*reads numeric data. If the values in this file are not numeric, than the optional argument`what=`must be included. For example, if the file contains strings, then`x = scan(Fname,what=character(0))`

will read this data. Note that`Fname`as used here is an**R**object whose value is the name of the file that contains the data.**Note**: if the file is not located in the working directory, then full path names must be used to specify the file.**R**uses unix conventions for path names regardless of the operating system. So, for example, in Windows a file located on the C-drive in folder StatData named Data1.txt would be scanned byx = scan("c:/StatData/Data1.txt")

The file name argument also can be a web address.**Data Frames and read.table()**. Tabular data contained in a file can be read by**R**using the*read.table()*function. Each column in the table is treated as a separate variable and variables can be numeric, logical, or character (strings). That is, different columns can be different types, but each column must be the same type. An example of such a file is

http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data

Note that the first few lines begin with the character . This is the comment character.**R**ignores that character and the remainder of the line. The first non-comment line contains names for the columns. In that case we must include the optional argument`header=TRUE`as follows:Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data", header=TRUE)

The first column in this file is not really data, but just gives the name of each city in the data set. These can be used as row names:Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data", header=TRUE, row.names=1)

- The value returned by
*read.table()*is a*data.frame*. This type of object can be thought of as an enhanced matrix. It has a*dimension*just like a matrix, the value of which is a vector containing the number of rows and number of columns. However, a data frame is intended to represent a data set in which each row is the set of variables obtained for each subject in the sample and each column contains the observations for each variable being measured. In the case of the Temperature data, these variables are:*JanTemp, Lat, Long*

Unlike a matrix, a data frame can have different types of variables, but each variable (column) must contain the same type.Individual variables in a data frame can be accessed several ways.

- Using
Latitude = Temp$Lat

- Name:
Latitude = Temp[["Lat"]]

- Number:
Latitude = Temp[[2]]

`Latitude`is a vector. If you want to extract a subset of the variables with all rows included, then use . The result is a data frame. If the original data frame has names, these are carried over to the new data frame. If you only want some of the rows, then specify these the way it is done with matrices:LatLong = Temp[2:3] #extract variables 2 through 3 LatLong = Temp[c("Lat","Long")] #extract Lat and Long LatLong1 = Temp[1:20,c("Lat","Long")] #extract first 20 rows for Lat and Long

Although it may seem like more work to use names, the advantage is that one does not need to know the index of the desired column, just its name.Additional variables can be added to a data frame as follows.

#create new variable named Region with same length as other variables in Temp Region = rep("NE",dim(Temp)[1]) # NE is defined to be Lat >= 39.75 and Long < 90 # SE is defined to be Lat < 39.75 and Long < 90 # SW is defined to be Lat < 39.75 and Long >= 90 # NW is defined to be Lat >= 39.75 and Long >= 90 Region[Temp$Lat < 39.75 & Temp$Long < 90] = "SE" Region[Temp$Lat < 39.75 & Temp$Long >= 90] = "SW" Region[Temp$Lat >= 39.75 & Temp$Long >= 90] = "NW" #give Region the same row names as Temp names(Region) = dimnames(Temp)[[1]] #make Region a factor Region = factor(Region) #add Region to Temp Temp1 = data.frame(Temp,Region) #plot January Temperature vs Region #since Region is a factor, this gives a boxplot plot(JanTemp ~ Region,data=Temp1)

- Using
**Accessing data in a spreadsheet**. If a table of data is contained in a spreadsheet like*Excel*, then the easiest way to import it into**R**is to save the table as a*comma-separated-values*file. Then use`read.table()`to read the file with separator argument`sep='',''`. The file

http://www.utdallas.edu/~ammann/SmokeCancer.csv

can be read into**R**bySmoke = read.table("http://www.utdallas.edu/~ammann/SmokeCancer.csv", header=TRUE,sep=",",row.names=1)

Note that 2 of the entries in this table are`NA`. These denote types of cancer that were not reported in that state during the time period covered by the data. We can change those entries to 0 as follows.Smoke[is.na(Smoke)] = 0

There is a companion function,

`write.table()`, that can be used to write a matrix or data frame to a file that then can be imported into a spreadsheet.**Saving graphics**. By default**R**uses a separate graphical window for the display of graphic commands. A graphic can be saved to a file using any of several different graphical file types. The most commonly used are*pdf()*and*png()*since these types can be imported into documents created by**Word**or L^{A}TEX. The first argument for these functions is the filename. Arguments`width=,height=`give the dimensions of the graphic. For*pdf()*the dimension units are inches, for*png()*the units are pixels.*pdf()*supports multi-page graphics, but*png()*only allows one page per file unless the file name has the form`Myplot%d.png`. For example,pdf("TempPlot.pdf",width=6,height=6) plot(JanTemp ~ Lat,data=Temp) plot(JanTemp ~ Region,data=Temp1) graphics.off() #creates a 2-page pdf document png("TempPlot%d.png",width=480,height=480) plot(JanTemp ~ Lat,data=Temp) plot(JanTemp ~ Region,data=Temp1) graphics.off() #creates two files: TempPlot1.png and TempPlot2.png

The function*graphics.off()*writes any closing material required by the graphic file type and then closes the graphics file.

**Exercise**. Work the exercises given in Section 2.4 and 3.4 of the *Owen-TheRGuide*
document referenced above.

There are a number of datasets included in the **R** distribution along with examples of their use
in the help pages. One example is given below.

# load cars data frame data(cars) # plot braking distance vs speed with custom x-labels and y-labels, # and axis numbers horizontal plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1) # add lowess line (smoothed trend line) lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red") # add plot title title(main = "cars data") # new plot of same variables on a log scale for both axes plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1, log = "xy") # add plot title title(main = "cars data (logarithmic scales)") # add lowess line (smoothed trend line) for the log scale lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red") # fit a regression model using log(speed) to predict log(dist) and # print a summary of the fit summary(fm1 = lm(log(dist) ~ log(speed), data = cars)) # save the current plotting parameters and then setup a new plot # region that puts 4 plots on the same page, 2 rows and 2 columns. # use custom margins for the plot region. opar = par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0), mar = c(4.1, 4.1, 2.1, 1.1)) # plot the diagnostic residual plots associated with a regression fit. plot(fm1) # restore the original plotting parameters par(opar)

2014-09-16