The following links provide introductions to the use of R:
http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
http://cran.r-project.org/doc/contrib/usingR.pdf
Other contributed books about R can be found on the CRAN site. Use the
Contributed link under Documentation on the left side of the CRAN web page.
Additional notes are provided below.
The S language was developed at Bell labs as a high-level computer
language for statistical computations and graphics. It has some similarities with
Matlab, but has some structures that Matlab does not have,
such as data frames, that are natural data structures for statistical
models. There are two implementations of this language currently available:
a commercial product, S-Plus, and a freely available open-source
product, R. R is available at
http://cran.r-project.org
These implementations are mostly, but not completely, compatible.
Note: in the examples below, the R prompt, >
, is included, but this
would not be typed on the command line. Lines that do not begin with this
prompt represent what is returned by R.
=
The value of an assignment is not automatically echoed to the terminal.
Lines with no assignment do result in the value of the expression
being echoed to the terminal.
> x = 2:20 > y = 15:1More general sequences can be generated with the seq() function. These operations produce vectors. Some examples:
> seq(5) [1] 1 2 3 4 5 > x = seq(2,20,length=5) > x [1] 2.0 6.5 11.0 15.5 20.0 > y = seq(5,18,by=3) > y [1] 5 8 11 14 17The function c can be used to combine different vectors into a single vector.
> c(x,y) [1] 2.0 6.5 11.0 15.5 20.0 5.0 8.0 11.0 14.0 17.0All vectors have a length which can be obtained by the function length()
> length(c(x,y)) [1] 10
> s = sum(x) > paste("Sum of x =",s) [1] "Sum of x = 55" > paste(x,y,sep=",") [1] "2,5" "6.5,8" "11,11" "15.5,14" "20,17" > paste("X",seq(x),sep="") [1] "X1" "X2" "X3" "X4" "X5"Note that the last example uses the expression seq(x). If the argument to seq() is a vector, then this expression is equivalent to seq(length(x)).
> names(x) = paste("X",seq(x),sep="") > x X1 X2 X3 X4 X5 2.0 6.5 11.0 15.5 20.0Elements of a vector are referenced by the function []. Arguments can be a vector of indices that refer to specific positions within the vector:
> x[2:4] X2 X3 X4 6.5 11.0 15.5 > x[c(2,5)] X2 X5 6.5 20.0Elements can be referenced by their names or by a logical vector
> x[c("X3","X4")] X3 X4 11.0 15.5 > xl = x > 10 > xl X1 X2 X3 X4 X5 FALSE FALSE TRUE TRUE TRUE > x[xl] X3 X4 X5 11.0 15.5 20.0The length of the referencing vector can be larger than the length of the vector that is being referenced as long as the referencing vector is either a vector of indices or names.
> ndx = rep(seq(x),2) > ndx [1] 1 2 3 4 5 1 2 3 4 5 > x[ndx] X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 2.0 6.5 11.0 15.5 20.0 2.0 6.5 11.0 15.5 20.0This is useful for table lookups. Suppose for example that Gender is a vector of elements that are either Male or Female:
> Gender [1] "Male" "Male" "Female" "Male" "Female"and Gcol is a vector of two colors whose names are the two unique elements of Gender
> Gcol = c("blue","red") > names(Gcol) = c("Male","Female") > Gcol Male Female "blue" "red" > GenderCol = Gcol[Gender] > GenderCol Male Male Female Male Female "blue" "blue" "red" "blue" "red"This will be useful for plotting data.
X = 1:10assigns to the object X the vector consisting of the integers 1 to 10.
M = matrix(X,nrow=5)puts the entries of X into a matrix named M that has 5 rows and 2 columns. The first column of M contains the first 5 elements of X and the second column of M contains the remaining 5 elements. If a vector does not fit exactly into the dimensions of the matrix, then a warning is returned.
> M [,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10The dimensions of a matrix are obtained by the function dim() which returns the number of rows and number of columns in a vector of length 2
> dim(M) [1] 5 2
[]
but with the number of arguments equal to the number of dimensions.
A matrix has two dimensions, so M[2,1] refers to the element
in row 2 and column 1.
> X = matrix(runif(100),nrow=20) > X[2:5,2:4] [1,] 0.731622617 0.6578677 0.7446229 [2,] 0.023472598 0.2111300 0.7775343 [3,] 0.001858455 0.2887734 0.8103568 [4,] 0.269611100 0.7527248 0.2127048Note: the function runif(n) returns a vector of n random numbers between 0 and 1. If one of the arguments to [,] is empty, then all elements of that dimension are returned. So X[2:4,] gives all columns of rows 2,3,4 and so is a matrix with 3 rows and the same number of columns as X.
> X = seq(20)/2 > Y = 2+6*X + rnorm(length(X),0,.5) > Z = matrix(runif(9),3,3) > All.data = list(Var1=X,Var2=Y,Zmat=Z) > names(All.data) [1] "Var1" "Var2" "Zmat" > All.data$Var1[1:5] [1] 0.5 1.0 1.5 2.0 2.5 > data(state) > state.x77["Texas",] Population Income Illiteracy Life Exp Murder HS Grad 12237.0 4188.0 2.2 70.9 12.2 47.4 Frost Area 35.0 262134.0
> StateData = state.x77 > dimnames(StateData)[[1]] = state.abb
> txill = state.x77["Texas","Illiteracy"] > highIll = state.x77[,"Illiteracy"] > txill > state.name[highIll] [1] "Louisiana" "Mississippi" "South Carolina"
> y = matrix(rnorm(20),ncol=2) > x = rep(paste("A",1:2,sep=""),5) > z = runif(10) > .5 > SAMP = data.frame(y,x,z) Y1 Y2 x z 1 0.2402750 1.3561348 A1 FALSE 2 0.3669875 -1.4239780 A2 FALSE 3 -1.5042563 1.2929657 A1 TRUE 4 1.2329026 0.3838835 A2 TRUE 5 -0.1241536 -0.5596217 A1 TRUE 6 -0.1784147 1.2920853 A2 FALSE 7 -1.2848231 1.7107087 A1 TRUE 8 0.7731956 0.6520663 A2 FALSE 9 -0.3515564 0.3169168 A1 TRUE 10 -1.3513955 1.3663698 A2 TRUENote that the rows and columns have names, referred to as dimnames. Arrays and data frames can be addressed through their names in addition to their position. Also note that variable x is a character vector, but the data.frame function automatically coerces that component to be a factor:
> is.factor(x) [1] FALSE > is.factor(SAMP$x) [1] TRUE
> plot(SAMP$Y1,SAMP$Y2) > plot(SAMP$x,SAMP$Y2)If the argument to plot is the result of a linear model fit using lm(), then the plot function will produce a set of diagnostic residual plots.
x = scan("c:/StatData/Data1.txt")The file name argument also can be a web address.
Data Frames and read.table(). Tabular data contained in a file can
be read by R using the read.table() function. Each column in the
table is treated as a separate variable and variables can be numeric, logical,
or character (strings). That is, different columns can be different types, but
each column must be the same type. An example of such a file is
http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data
Note that the first few lines begin with the character . This is the
comment character. R ignores that character and the remainder of the line. The
first non-comment line contains names for the columns. In that case we must
include the optional argument header=TRUE as follows:
Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data", header=TRUE)The first column in this file is not really data, but just gives the name of each city in the data set. These can be used as row names:
Temp = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/Temperature.data", header=TRUE, row.names=1)
Individual variables in a data frame can be accessed several ways.
Latitude = Temp$Lat
Latitude = Temp[["Lat"]]
Latitude = Temp[[2]]
LatLong = Temp[2:3] #extract variables 2 through 3 LatLong = Temp[c("Lat","Long")] #extract Lat and Long LatLong1 = Temp[1:20,c("Lat","Long")] #extract first 20 rows for Lat and LongAlthough it may seem like more work to use names, the advantage is that one does not need to know the index of the desired column, just its name.
Additional variables can be added to a data frame as follows.
#create new variable named Region with same length as other variables in Temp Region = rep("NE",dim(Temp)[1]) # NE is defined to be Lat >= 39.75 and Long < 90 # SE is defined to be Lat < 39.75 and Long < 90 # SW is defined to be Lat < 39.75 and Long >= 90 # NW is defined to be Lat >= 39.75 and Long >= 90 Region[Temp$Lat < 39.75 & Temp$Long < 90] = "SE" Region[Temp$Lat < 39.75 & Temp$Long >= 90] = "SW" Region[Temp$Lat >= 39.75 & Temp$Long >= 90] = "NW" #give Region the same row names as Temp names(Region) = dimnames(Temp)[[1]] #make Region a factor Region = factor(Region) #add Region to Temp Temp1 = data.frame(Temp,Region) #plot January Temperature vs Region #since Region is a factor, this gives a boxplot plot(JanTemp ~ Region,data=Temp1)
Smoke = read.table("http://www.UTDallas.edu/~ammann/SmokeCancer.csv", header=TRUE,sep=",",row.names=1)Note that 2 of the entries in this table are NA. These denote types of cancer that were not reported in that state during the time period covered by the data. We can change those entries to 0 as follows.
Smoke[is.na(Smoke)] = 0
There is a companion function, write.table(), that can be used to write a matrix or data frame to a file that then can be imported into a spreadsheet.
pdf("TempPlot.pdf",width=6,height=6) plot(JanTemp ~ Lat,data=Temp) plot(JanTemp ~ Region,data=Temp1) graphics.off() #creates a 2-page pdf document png("TempPlot%d.png",width=480,height=480) plot(JanTemp ~ Lat,data=Temp) plot(JanTemp ~ Region,data=Temp1) graphics.off() #creates two files: TempPlot1.png and TempPlot2.pngThe function graphics.off() writes any closing material required by the graphic file type and then closes the graphics file.
Exercise. Work the exercises given in Section 2.4 and 3.4 of the Owen-TheRGuide document referenced above.
There are a number of datasets included in the R distribution along with examples of their use in the help pages. One example is given below.
# load cars data frame data(cars) # plot braking distance vs speed with custom x-labels and y-labels, # and axis numbers horizontal plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1) # add lowess line (smoothed trend line) lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red") # add plot title title(main = "cars data") # new plot of same variables on a log scale for both axes plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", las = 1, log = "xy") # add plot title title(main = "cars data (logarithmic scales)") # add lowess line (smoothed trend line) for the log scale lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red") # fit a regression model using log(speed) to predict log(dist) and # print a summary of the fit summary(fm1 = lm(log(dist) ~ log(speed), data = cars)) # save the current plotting parameters and then setup a new plot # region that puts 4 plots on the same page, 2 rows and 2 columns. # use custom margins for the plot region. opar = par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0), mar = c(4.1, 4.1, 2.1, 1.1)) # plot the diagnostic residual plots associated with a regression fit. plot(fm1) # restore the original plotting parameters par(opar)