Next: Factor variables in linear Up: stat6341 Previous: Least Squares and the

# Linear Models in R

The basic function in R to fit least squares regression models is lm(). To illustrate the use of this function, consider the data set,
http://www.utdallas.edu/~ammann/stat6341scripts/cars.data
The goal here is to determine how mileage (mpg) depends on displacement, horsepower, and weight for cars with displacement less than 120. Also, cars with diesel engines will be removed.

```Cars = read.table("http://www.utdallas.edu/~ammann/stat6341scripts/cars.data",
diesel.ndx = grep("diesel",dimnames(Cars)[[1]])
Cars = Cars[-diesel.ndx,]
mpgcut = 200 #parameterize cut point for mpg
Cars1 = Cars[Cars\$displacement < mpgcut,]
Cars.lm = lm(mpg ~ displacement + horsepower + weight, data=Cars1)
# residual plots
plot(Cars.lm) #standard plots for lm
# redo residuals vs fitted with annotation for this data
plot(fitted(Cars.lm),residuals(Cars.lm),pch=19)
abline(h=0,col="red")
title("Residuals vs Fitted for Cars data")
# same text will be used for several plots so assign it a name
info.txt = paste("Diesel cars removed, Displacement <",mpgcut)
mtext(info.txt,line=.25)
#looks OK
plot(residuals(Cars.lm) ~ displacement, data=Cars1,pch=19)
abline(h=0,col="red")
title("Residuals vs Displacement for Cars data")
mtext(info.txt,line=.25)
#looks OK
plot(residuals(Cars.lm) ~ horsepower, data=Cars1,pch=19)
abline(h=0,col="red")
title("Residuals vs Horsepower for Cars data")
mtext(info.txt,line=.25)
#looks OK
plot(residuals(Cars.lm) ~ weight, data=Cars1,pch=19)
abline(h=0,col="red")
title("Residuals vs Weight for Cars data")
#looks OK
qqnorm(residuals(Cars.lm),main="Normal Q-Q Plot for Cars Data")
qqline(residuals(Cars.lm),col="red")
mtext(info.txt,line=.25)
#looks OK
#print summary of LS fit
print(summary(Cars.lm)) #print() is only needed if using source()
#now consider a quadratic model that just uses horsepower and horsepower^2 as predictors
horsepower2 = Cars1\$horsepower^2
Cars2.lm = lm(mpg ~ horsepower + horsepower2, data=Cars1)
#diagnostic plots
plot(Cars2.lm)
#looks good
print(summary(Cars2.lm))
#plot data
plot(mpg ~ horsepower, data=Cars1,pch=19)
#define colors for various lines that will be superimposed on plot
mpg.col = c("blue","ForestGreen","DarkRed")
names(mpg.col) = c("fit","conf","pred")
#obtain fitted values over the range of horsepower values
#construct data frame for predict function
HP = seq(min(Cars1\$horsepower),max(Cars1\$horsepower),length=400)
HP.data = data.frame(horsepower=HP,horsepower2=HP^2)
mpg.conf = predict(Cars2.lm,newdata=HP.data,interval="confidence")
lines(HP,mpg.conf[,"fit"],col=mpg.col["fit"])
lines(HP,mpg.conf[,"upr"],col=mpg.col["conf"])
lines(HP,mpg.conf[,"lwr"],col=mpg.col["conf"])
mpg.pred = predict(Cars2.lm,newdata=HP.data,interval="prediction")
lines(HP,mpg.pred[,"upr"],col=mpg.col["pred"])
lines(HP,mpg.pred[,"lwr"],col=mpg.col["pred"])
#note that part of upper prediction band is outside plot
#redo with expanded y-axis limits
mpg.lim = range(c(mpg.pred,Cars1\$mpg))
plot(mpg ~ horsepower, data=Cars1,pch=19,ylim=mpg.lim)
lines(HP,mpg.conf[,"fit"],col=mpg.col["fit"])
mtext(info.txt,line=.25)
lines(HP,mpg.conf[,"upr"],col=mpg.col["conf"])
lines(HP,mpg.conf[,"lwr"],col=mpg.col["conf"])
lines(HP,mpg.pred[,"upr"],col=mpg.col["pred"])
lines(HP,mpg.pred[,"lwr"],col=mpg.col["pred"])
legend(min(Cars1\$horsepower),mpg.lim[1],
legend=c("Fitted","95% Confidence","95% Prediction"),
lty=1,col=mpg.col,yjust=0)
```

The lm() function is most conveniently applied with the data contained in a data frame. This data frame is specified in lm() by the argument data=. If transformations are required for some variables, then a new data frame can be constructed that contains the transformed variables along with needed untransformed variables. Use of the data= argument allows use of some shortcuts for the model formula. For example, if the model contains more than a few variables, then the formula

```Y ~ .
```
is a shortcut for using all variables in the specified data frame except Y as predictor variables in an additive model.

It is critically important that assumptions are checked before using a model; otherwise we are just doing arithmetic, not statistical analysis. The plot() function applied to a linear model object by default produces four plots:

1. residuals vs fitted: checks for nonlinearity and marks potential outliers
2. normal quantile-quantile plot of residuals: checks normality assumption
3. scale-location: checks for homogeneity of variance
4. Cook's D: checks for observations with high influence on the fit
Models with multiple predictors also should include plots of residuals vs each predictor to check for possible nonlinearity in specific predictors. These are not needed for a polynomial model with no other predictors. In such cases the basic residual plot is sufficient.

Predicted values are obtained by the predict() function. By default predict() with an lm object as its only argument returns predicted values for the data set used for fitting the model. Optional arguments produce additional information.

The predict() function in R is a generic function that calls a specific function depending on the class attribute of its main argument. For example class(Cars.lm) returns lm. That instructs R to use the specific function predict.lm. Other classes use other specific functions for predict(). A list of the specific predict functions is given by methods(predict).

Some generic functions include a function with the suffix default. For example print.default() is defined for the generic print() function. For these generic functions the default method is used if the object's class does not match any of the suffixes for the generic function. This process can be circumvented by using a specific function instead of the generic function. For example,

```summary(Cars.lm) #uses summary.lm() since class(Cars.lm) is lm
summary.aov(Cars.lm) #force use of summary.aov()
```

A summary of a linear model fit is obtained by the summary() function. This function returns a list whose components include residuals, coefficient estimates, r-squared, and the estimated residual s.d. Helper functions residuals(), coefficients() extract the corresponding components of a fit. These are generic functions with specific functions defined for the appropriate type of model object specified in their argument. To reduce typing, resid(), coef() are the same functions. The coef() function returns a matrix whose columns include coefficient estimates, their standard errors, t-values and corresponding p-values. The p-values are equivalent to a partial F-test for significance of the parameter when it is added to a model with all other parameters included. In the language of SAS these correspond to Type III tests. However, those p-values should not be used for variable selection. That topic is discussed in more detail below.

To obtain predicted values for a linear model fit using predictor values other than the data used for fitting, it is necessary to define a new data frame that contains variables with the same names as those in right hand side of the model formula. An example is given in the code above:

```HP = seq(min(Cars1\$horsepower),max(Cars1\$horsepower),length=400)
HP.data = data.frame(horsepower=HP,horsepower2=HP^2)
mpg.conf = predict(Cars2.lm,newdata=HP.data,interval="confidence")
```
Confidence and prediction bands are obtained with the interval= argument. The default value for that argument is none. The argument
interval=confidence
uses the standard error associated with prediction of the mean response and the argument
interval=prediction
uses the standard error for prediction of an individual response. These arguments return a matrix with columns named fit,lwr,upr.

Subsections

Next: Factor variables in linear Up: stat6341 Previous: Least Squares and the
ammann
2017-12-10