The following links provide an excellent introduction to the use of R:
https://cran.r-project.org/doc/contrib/Robinson-icebreaker.pdf (somewhat more advanced)
Other contributed books about R can be found on the CRAN site. Use the Contributed link under Documentation on the left side of the CRAN web page. Additional notes are provided below.
The S language was developed at Bell labs as a high-level computer
language for statistical computations and graphics. It has some similarities with
Matlab, but has some structures that Matlab does not have,
such as data frames, that are natural data structures for statistical
models. There are two implementations of this language currently available:
a commercial product, S-Plus, and a freely available open-source
product, R. R is available at
These implementations are mostly, but not completely, compatible.
Note: in the examples below the R prompt,
> , is included but this would not be
typed on the command line. It is used here to differentiate between input to R and output that
is returned to the console after a command is entered.
On Linux or Macs, R can be run from a shell by entering
at a shell prompt. The R session is ended by entering
This will generate a query from R whether to save the workspace image. Enter n.
On Windows and Macs R is packaged as a windowed application that starts with a command window. RStudio also is a windowed application that includes a window for entering commands, a window that describes the property of objects that have been created during the session, and a window for graphics.
R's Workspace. The Workspace contains all the objects created or loaded during an R session. These objects only exist in the computer's memory, not on the physical hard drive and will disappear when R is exited. R offers a choice to the user when exiting: save the workspace or do not save it. If the Workspace is not saved, all objects created during the session will be lost. That's no problem if you are using it only as a mathematical or statistical calculator. If you are performing an analysis, but must exit before completing it, then you don't want to lose what you have already done. There is an alternative that I recommend instead of saving the workspace: write the commands you wish to enter into a text file and then copy/paste from the edit window into the R console. Even though this may seem like extra work, it has three advantages:
Rstudio has an extensive set of resources to help users. Go to the Help tab on the right-hand window and click on An Introduction to R under Manuals. See section 2.1-2.7 for details about the following.
=or the two characters
<-. The second assignment operator is older but
=is used more commonly now since it is just a single character. When an assignment is made, its value is not echoed to the terminal. Lines with no assignment do result in the value of the expression being echoed to the terminal.
> x = 2:20 > y = 15:1More general sequences can be generated with the seq() function. These operations produce vectors. Some examples:
> seq(5)  1 2 3 4 5 > x = seq(2,20,length=5) > x  2.0 6.5 11.0 15.5 20.0 > y = seq(5,18,by=3) > y  5 8 11 14 17The function c can be used to combine different vectors into a single vector.
> c(x,y)  2.0 6.5 11.0 15.5 20.0 5.0 8.0 11.0 14.0 17.0All vectors have an attribute named length which can be obtained by the function length()
> length(c(x,y))  10A scalar is just a vector of length 1.
> s = sum(x) > paste("Sum of x =",s)  "Sum of x = 55" > paste(x,y,sep=",")  "2,5" "6.5,8" "11,11" "15.5,14" "20,17" > paste("X",seq(length(x)),sep="")  "X1" "X2" "X3" "X4" "X5"
> names(x) = paste("X",seq(x),sep="") > x X1 X2 X3 X4 X5 2.0 6.5 11.0 15.5 20.0Elements of a vector are referenced by the function . Arguments can be a vector of indices that refer to specific positions within the vector:
> x[2:4] X2 X3 X4 6.5 11.0 15.5 > x[c(2,5)] X2 X5 6.5 20.0Elements also can be referenced by their names or by a logical vector in addition to their index:
> x[c("X3","X4")] X3 X4 11.0 15.5 > xl = x > 10 > xl X1 X2 X3 X4 X5 FALSE FALSE TRUE TRUE TRUE > x[xl] X3 X4 X5 11.0 15.5 20.0The length of the referencing vector can be larger than the length of the vector that is being referenced as long as the referencing vector is either a vector of indices or names.
> ndx = rep(seq(x),2) > ndx  1 2 3 4 5 1 2 3 4 5 > x[ndx] X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 2.0 6.5 11.0 15.5 20.0 2.0 6.5 11.0 15.5 20.0This is useful for table lookups. Suppose for example that Gender is a vector of elements that are either Male or Female:
> Gender  "Male" "Male" "Female" "Male" "Female"and Gcol is a vector of two colors whose names are the two unique elements of Gender
> Gcol = c("blue","red") > names(Gcol) = c("Male","Female") > Gcol Male Female "blue" "red" > GenderCol = Gcol[Gender] > GenderCol Male Male Female Male Female "blue" "blue" "red" "blue" "red"This will be useful for plotting data.
X = 1:10assigns to the object X the vector consisting of the integers 1 to 10.
M = matrix(X,nrow=5)puts the entries of X into a matrix named M that has 5 rows and 2 columns. The first column of M contains the first 5 elements of X and the second column of M contains the remaining 5 elements. If a vector does not fit exactly into the dimensions of the matrix, then a warning is returned.
> M [,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10The dimensions of a matrix are obtained by the function dim() which returns the number of rows and number of columns as a vector of length 2.
> dim(M)  5 2
but with the number of arguments equal to the number of dimensions. A matrix has two dimensions, so M[2,1] refers to the element in row 2 and column 1.
> X = matrix(runif(100),nrow=20) > X[2:5,2:4] [1,] 0.731622617 0.6578677 0.7446229 [2,] 0.023472598 0.2111300 0.7775343 [3,] 0.001858455 0.2887734 0.8103568 [4,] 0.269611100 0.7527248 0.2127048Note: the function runif(n) returns a vector of n random numbers between 0 and 1. Each time the function runif is called it will return a new set of values. So if the runif() function is run again, a different set of values will be returned.
Sunspots = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/sunspots.txt",header=TRUE)Note that the filename argument in this case is a web address. The argument also can be the name of a file on your computer. The second argument indicates that the first row of this file contains names for the columns. These are accessed by
names(Sunspots)Suppose we wish to plot sunspot numbers versus year. There are several ways to accomplish this.
plot(Sunspots[,1],Sunspots[,2]) plot(Sunspots[,1],Sunspots[,2], type="l") plot(Number ~ Year, data=Sunspots, type="l")The last method uses what is referred to as the formula interface for the plot function. Now let's add a title to make the plot more informative.
title("Yearly mean total sunspot numbers")To be more informative, add the range of years contained in this data set.
title("Yearly mean total sunspot numbers, 1700-2016")The title can be split into two lines as follows
title("Yearly mean total sunspot numbers\n1700-2016")using the newline character
\n. Note that this requires that we already know the range of years contained in the data. Alternatively, we could obtain that range from the data. That would make our command file more general. The following file contains these commands:
> X = seq(20)/2 > Y = 2+6*X + rnorm(length(X),0,.5) > Z = matrix(runif(9),3,3) > All.data = list(Var1=X,Var2=Y,Zmat=Z) > names(All.data)  "Var1" "Var2" "Zmat" > All.data$Var1[1:5]  0.5 1.0 1.5 2.0 2.5 > data(state) > state.x77["Texas",] Population Income Illiteracy Life Exp Murder HS Grad 12237.0 4188.0 2.2 70.9 12.2 47.4 Frost Area 35.0 262134.0
> StateData = state.x77 > dimnames(StateData)[] = state.abb
> txill = state.x77["Texas","Illiteracy"] > highIll = state.x77[,"Illiteracy"] > txill > state.name[highIll]  "Louisiana" "Mississippi" "South Carolina"
F = D*Eresults in an matrix F whose elements are
> y = matrix(rnorm(20),ncol=2) > x = rep(paste("A",1:2,sep=""),5) > z = runif(10) > .5 > SAMP = data.frame(y,x,z) Y1 Y2 x z 1 0.2402750 1.3561348 A1 FALSE 2 0.3669875 -1.4239780 A2 FALSE 3 -1.5042563 1.2929657 A1 TRUE 4 1.2329026 0.3838835 A2 TRUE 5 -0.1241536 -0.5596217 A1 TRUE 6 -0.1784147 1.2920853 A2 FALSE 7 -1.2848231 1.7107087 A1 TRUE 8 0.7731956 0.6520663 A2 FALSE 9 -0.3515564 0.3169168 A1 TRUE 10 -1.3513955 1.3663698 A2 TRUENote that the rows and columns have names, referred to as dimnames. Arrays and data frames can be addressed through their names in addition to their position. Also note that variable x is a character vector, but the data.frame function automatically coerces that component to be a factor:
> is.factor(x)  FALSE > is.factor(SAMP$x)  TRUE
> plot(SAMP$Y1,SAMP$Y2) > plot(SAMP$x,SAMP$Y2)A better way to produce these plots is to use the formula interface along with the data= argument if the variables are contained within a data frame.
> plot(Y2 ~ Y1, data=SAMP) > plot(Y2 ~ x, data=SAMP)
x = scan("c:/StatData/Data1.txt")The file name argument also can be a web address.
Data Frames and read.table(). Tabular data contained in a file can
be read by R using the read.table() function. Each column in the
table is treated as a separate variable and variables can be numeric, logical,
or character (strings). That is, different columns can be different types, but
each column must be the same type. An example of such a file is
Note that the first few lines begin with the character . This is the comment character. R ignores that character and the remainder of the line. The first non-comment line contains names for the columns. In that case we must include the optional argument header=TRUE as follows:
Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data", header=TRUE)The first column in this file is not really data, but just gives the name of each city in the data set. These can be used as row names:
Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data", header=TRUE, row.names=1)
The value returned by read.table() is a data.frame. This type of object can
be thought of as an enhanced matrix. It has a dimension just like a matrix, the value of
which is a vector containing the number of rows and number of columns. However, a data frame is
intended to represent a data set in which each row is the set of variables obtained for each subject
in the sample and each column contains the observations for each variable being measured. In the
case of the Temperature data, these variables are:
JanTemp, Lat, Long
Unlike a matrix, a data frame can have different types of variables, but each variable (column) must contain the same type.
Individual variables in a data frame can be accessed several ways.
Latitude = Temp$Lat
Latitude = Temp[["Lat"]]
Latitude = Temp[]
LatLong = Temp[2:3] #extract variables 2 through 3 LatLong = Temp[c("Lat","Long")] #extract Lat and Long LatLong1 = Temp[1:20,c("Lat","Long")] #extract first 20 rows for Lat and LongAlthough it may seem like more work to use names, the advantage is that one does not need to know the index of the desired column, just its name.
Additional variables can be added to a data frame as follows.
#create new variable named Region with same length as other variables in Temp Region = rep("NE",dim(Temp)) # NE is defined to be Lat >= 39.75 and Long < 90 # SE is defined to be Lat < 39.75 and Long < 90 # SW is defined to be Lat < 39.75 and Long >= 90 # NW is defined to be Lat >= 39.75 and Long >= 90 Region[Temp$Lat < 39.75 & Temp$Long < 90] = "SE" Region[Temp$Lat < 39.75 & Temp$Long >= 90] = "SW" Region[Temp$Lat >= 39.75 & Temp$Long >= 90] = "NW" #give Region the same row names as Temp names(Region) = dimnames(Temp)[] #make Region a factor Region = factor(Region) #add Region to Temp Temp1 = data.frame(Temp,Region) #plot January Temperature vs Region #since Region is a factor, this gives a boxplot plot(JanTemp ~ Region,data=Temp1)
sep=",". The file
Smoke = read.table("http://www.utdallas.edu/~ammann/SmokeCancer.csv", header=TRUE,sep=",",row.names=1)Note that 2 of the entries in this table are NA. These denote types of cancer that were not reported in that state during the time period covered by the data. We can change those entries to 0 as follows.
Smoke[is.na(Smoke)] = 0
There is a companion function, write.table(), that can be used to write a matrix or data frame to a file that then can be imported into a spreadsheet.
pdf("TempPlot.pdf",width=6,height=6) plot(JanTemp ~ Lat,data=Temp) plot(JanTemp ~ Region,data=Temp1) graphics.off() #creates a 2-page pdf document png("TempPlot%d.png",width=480,height=480) plot(JanTemp ~ Lat,data=Temp) plot(JanTemp ~ Region,data=Temp1) graphics.off() #creates two files: TempPlot1.png and TempPlot2.pngThe function graphics.off() writes any closing material required by the graphic file type and then closes the graphics file.
There are a number of datasets included in the R distribution along with examples of their use in the help pages. One example is given below.
# load cars data frame data(cars) # plot braking distance vs speed with custom x-labels and y-labels, # and axis numbers horizontal plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1) # add plot title title(main = "Cars data") # new plot of same variables on a log scale for both axes plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1, log = "xy") # add plot title title(main = "Cars data (logarithmic scales)") # fit a regression model using log(speed) to predict log(dist) and # print a summary of the fit summary(fm1 = lm(log(dist) ~ log(speed), data = cars)) # save the current plotting parameters and then setup a new plot # region that puts 4 plots on the same page, 2 rows and 2 columns. # use custom margins for the plot region. opar = par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0), mar = c(4.1, 4.1, 2.1, 1.1)) # plot the diagnostic residual plots associated with a regression fit. plot(fm1) # restore the original plotting parameters par(opar)