The basic function in **R** to fit least squares regression models is *lm()*. To
illustrate the use of this function, consider the data set,

http://www.utdallas.edu/~ammann/stat6341scripts/cars.data

The goal here is to determine how mileage (**mpg**) depends on **displacement**,
**horsepower**, and **weight** for cars with displacement less than 120. Also, cars
with diesel engines will be removed.

Cars = read.table("http://www.utdallas.edu/~ammann/stat6341scripts/cars.data", header=TRUE,row.names=1) diesel.ndx = grep("diesel",dimnames(Cars)[[1]]) Cars = Cars[-diesel.ndx,] mpgcut = 200 #parameterize cut point for mpg Cars1 = Cars[Cars$displacement < mpgcut,] Cars.lm = lm(mpg ~ displacement + horsepower + weight, data=Cars1) # residual plots plot(Cars.lm) #standard plots for lm # redo residuals vs fitted with annotation for this data plot(fitted(Cars.lm),residuals(Cars.lm),pch=19) abline(h=0,col="red") title("Residuals vs Fitted for Cars data") # same text will be used for several plots so assign it a name info.txt = paste("Diesel cars removed, Displacement <",mpgcut) mtext(info.txt,line=.25) #looks OK plot(residuals(Cars.lm) ~ displacement, data=Cars1,pch=19) abline(h=0,col="red") title("Residuals vs Displacement for Cars data") mtext(info.txt,line=.25) #looks OK plot(residuals(Cars.lm) ~ horsepower, data=Cars1,pch=19) abline(h=0,col="red") title("Residuals vs Horsepower for Cars data") mtext(info.txt,line=.25) #looks OK plot(residuals(Cars.lm) ~ weight, data=Cars1,pch=19) abline(h=0,col="red") title("Residuals vs Weight for Cars data") #looks OK qqnorm(residuals(Cars.lm),main="Normal Q-Q Plot for Cars Data") qqline(residuals(Cars.lm),col="red") mtext(info.txt,line=.25) #looks OK #print summary of LS fit print(summary(Cars.lm)) #print() is only needed if using source() #now consider a quadratic model that just uses horsepower and horsepower^2 as predictors horsepower2 = Cars1$horsepower^2 Cars2.lm = lm(mpg ~ horsepower + horsepower2, data=Cars1) #diagnostic plots plot(Cars2.lm) #looks good print(summary(Cars2.lm)) #plot data plot(mpg ~ horsepower, data=Cars1,pch=19) #define colors for various lines that will be superimposed on plot mpg.col = c("blue","ForestGreen","DarkRed") names(mpg.col) = c("fit","conf","pred") #obtain fitted values over the range of horsepower values #construct data frame for predict function HP = seq(min(Cars1$horsepower),max(Cars1$horsepower),length=400) HP.data = data.frame(horsepower=HP,horsepower2=HP^2) mpg.conf = predict(Cars2.lm,newdata=HP.data,interval="confidence") lines(HP,mpg.conf[,"fit"],col=mpg.col["fit"]) title("MPG vs Horsepower: Quadratic Fit") #add 95\% confidence bands lines(HP,mpg.conf[,"upr"],col=mpg.col["conf"]) lines(HP,mpg.conf[,"lwr"],col=mpg.col["conf"]) #now add prediction bands mpg.pred = predict(Cars2.lm,newdata=HP.data,interval="prediction") lines(HP,mpg.pred[,"upr"],col=mpg.col["pred"]) lines(HP,mpg.pred[,"lwr"],col=mpg.col["pred"]) #note that part of upper prediction band is outside plot #redo with expanded y-axis limits mpg.lim = range(c(mpg.pred,Cars1$mpg)) plot(mpg ~ horsepower, data=Cars1,pch=19,ylim=mpg.lim) lines(HP,mpg.conf[,"fit"],col=mpg.col["fit"]) title("MPG vs Horsepower: Quadratic fit") mtext(info.txt,line=.25) lines(HP,mpg.conf[,"upr"],col=mpg.col["conf"]) lines(HP,mpg.conf[,"lwr"],col=mpg.col["conf"]) lines(HP,mpg.pred[,"upr"],col=mpg.col["pred"]) lines(HP,mpg.pred[,"lwr"],col=mpg.col["pred"]) legend(min(Cars1$horsepower),mpg.lim[1], legend=c("Fitted","95% Confidence","95% Prediction"), lty=1,col=mpg.col,yjust=0)

The *lm()* function is most conveniently applied with the data contained in a data frame.
This data frame is specified in *lm()* by the argument `data=`. If transformations
are required for some variables, then a new data frame can be constructed that contains the
transformed variables along with needed untransformed variables. Use of the `data=` argument
allows use of some shortcuts for the model formula. For example, if the model contains more than a
few variables, then the formula

Y ~ .is a shortcut for using all variables in the specified data frame except

It is critically important that assumptions are checked before using a model; otherwise we are
just doing arithmetic, not statistical analysis. The *plot()* function applied to a
linear model object by default produces four plots:

- residuals vs fitted: checks for nonlinearity and marks potential outliers
- normal quantile-quantile plot of residuals: checks normality assumption
- scale-location: checks for homogeneity of variance
- Cook's D: checks for observations with high influence on the fit

Predicted values are obtained by the *predict()* function. By default *predict()*
with an *lm* object as its only argument returns predicted values for the data set used for
fitting the model. Optional arguments produce additional information.

The *predict()* function in **R** is a generic function that calls a specific function
depending on the *class* attribute of its main argument. For example `class(Cars.lm)`
returns *lm*. That instructs **R** to use the specific function *predict.lm*.
Other classes use other specific functions for *predict()*. A list of the specific
predict functions is given by `methods(predict)`.

Some generic functions include a function with the suffix *default*. For example
*print.default()* is defined for the generic *print()* function. For these generic
functions the default method is used if the object's class does not match any of the suffixes
for the generic function. This process can be circumvented by using a specific function instead of
the generic function. For example,

summary(Cars.lm) #uses summary.lm() since class(Cars.lm) is lm summary.aov(Cars.lm) #force use of summary.aov()

A summary of a linear model fit is obtained by the *summary()* function. This function
returns a list whose components include residuals, coefficient estimates, r-squared, and the
estimated residual s.d. Helper functions *residuals(), coefficients()* extract the
corresponding components of a fit. These are generic functions with specific functions defined for
the appropriate type of model object specified in their argument. To reduce typing,
*resid(), coef()* are the same functions. The *coef()* function returns a matrix
whose columns include coefficient estimates, their standard errors, t-values and corresponding
p-values. The p-values are equivalent to a partial F-test for significance of the parameter when
it is added to a model with all other parameters included. In the language of **SAS** these
correspond to Type III tests. However, those p-values should not be used for variable selection.
That topic is discussed in more detail below.

To obtain predicted values for a linear model fit using predictor values other than the data used for fitting, it is necessary to define a new data frame that contains variables with the same names as those in right hand side of the model formula. An example is given in the code above:

HP = seq(min(Cars1$horsepower),max(Cars1$horsepower),length=400) HP.data = data.frame(horsepower=HP,horsepower2=HP^2) mpg.conf = predict(Cars2.lm,newdata=HP.data,interval="confidence")Confidence and prediction bands are obtained with the

uses the standard error associated with prediction of the mean response and the argument

uses the standard error for prediction of an individual response. These arguments return a matrix with columns named

2017-12-10