In the first part of the course we discussed how to fit a least squares regression line from data, and how to use that line to make predictions. In this section we will continue that discussion with the goal of deriving confidence intervals for these predictions. As is the case with every statistical method, estimates of population characteristics are only as good as the statistical models that are the bases for the estimation process.

The situation we are addressing here is one in which we are interested in two characteristics, or variables, associated with a population, and we would like to use one of these variables to predict the value of the other for some individuals in the population. The process we must follow to accomplish this is outlined here.

- Construct a mathematical model for the relationship between the two variables in the population.
- Randomly select a sample from the population and use this sample to fit the model.
- Determine if the model adequately expresses the relationship between the
variables based on the random sample. That is, determine whether or not the
assumptions made to construct the model are reasonably satisfied for the random
sample that is used for the fit. This step is absolutely critical to the success
of the process, but the methods involved in the step are beyond the scope of
this course. Part of the difficulty with this step is that it requires us to
determine whether the model is wrong or the data is wrong if there are problems
with the assumptions. This is not a straight-forward process that can be
automated. Since these methods are beyond the scope of the course, we will not
discuss them, but you should always be aware that
*Best Professional Practice*requires that this step be performed. No statistical predictions are valid without this step. - If the model is not adequate, then the model must be modified and refit, which returns us to the previous step.
- Once we have constructed a model whose assumptions are reasonable for the data of interest, we can use it to make predictions and to make inferences about the relationship between the variables.

For this course we assume that the relationship between the two variables of
interest is linear, which can be expressed by the population model,

This model describes the values of the three variables for a randomly selected individual in the population.

The basic assumptions that we make about this model and the population are that
all of the relationship between *X* and *Y* can be represented by the linear
equation,

and that individual deviations from this relationship are unrelated to the values of

- The errors are statistically independent.
- The distribution of the error variables in the population is approximately normal, .

Since the behavior of the errors does not involve the *X* variables, then the
linear function,
, must contain all of the relationship
between *X* and *Y*. Methods to identify appropriate prediction models and to
verify their conditions are beyond the scope of this course. Nevertheless,
reports that make inferences based on such prediction models should include
a description of the results of these steps.

Now suppose we have a random sample of size *n* from the population and
use least squares regression to fit this linear model,

where

The differences between the actual and predicted Y-values, denoted by ,
are referred to as the model residuals and represent estimates of the actual
errors. Verification of model assumptions involves checking to see if the
behavior of these residuals is related to *X* and to compare their distribution
to the normal distribution. This is accomplished by plotting residuals versus *X*
(or the fitted values) to check for curvature and homogeneity of variance and by
a quantile-quantile plot of the residuals to check for normality. How to do this in **R**
is shown below. Let `Y.lm` represent the result of fitting a linear regression
model to predict *Y* from *X*.

Y.lm = lm(Y ~ X) Y.res = residuals(Y.lm) #check for homogeneity of variance plot(Y.res ~ X, pch=19,ylab="Residuals") #check for normality qqnorm(Y.res,pch=19)

If the model assumptions are reasonable, then we can ask whether or not *X* is useful for
predicting *Y*. This question can be answered by a test of the hypotheses:

To test these hypotheses we need to consider three sums of squared deviations:

These are referred to as

where

Therefore, under the null hypothesis,

and

This implies that we can test those hypotheses using a 1-sided F-test with test statistic

The p-value for this test is the area to the right of

The test statistic and p-value are reported in

summary(Y.lm)

**Example**. The file,

http://www.utdallas.edu/~ammann/stat3355scripts/mileage.csv

contains weights, engine displacements, and gas mileages for 60 cars. Suppose we wish to examine
how mileage depends on weight.

Cars = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/mileage.csv", header=TRUE,sep=",",row.names=1) Mileage.lm = lm(Mileage ~ Weight,data=Cars) #fit model #check for curvature and homogeneity of variance plot(residuals(Mileage.lm) ~ Weight,data=Cars,pch=19,ylab="Residuals") abline(h=0,col="red") title("Residuals vs Weight") #check for normality of residuals qqnorm(residuals(Mileage.lm),pch=19,main="Quantile-Quantile Plot of Residuals") #check assumptions qqline(residuals(Mileage.lm)) #print summary summary(Mileage.lm)The residual plots look reasonable. The summary reports that the F-statistic is 148.3 with

Now consider using the linear regression model for prediction. The predicted Y-value is an estimate of two population quantities.

- The expected response,
. This represents the population
mean value of
*Y*for all individuals with . - The value of
*Y*for an individual in the population with ,

We can describe the accuracy of these predicted values by constructing confidence intervals for
them. In both cases, these confidence intervals have the form,

where is the appropriate value from the

Specifically, a confidence interval for the mean response at is given by

where

and is the sample variance of the

where

It is important to remember that these intervals can be calculated for any set of data. However,
they are * confidence* or

These confidence intervals can be obtained in **R** using the `predict()` function.
By default, this returns just the predicted values for the data when its argument is a linear
model object. Confidence and prediction intervals require additional arguments.

predict(Y.lm,interval="confidence") #confidence interval for mean responses in the data predict(Y.lm,interval="prediction") #confidence interval for individual responses in the dataThe value returned in each case is a matrix with columns,

We also can obtain confidence intervals for predictions of new observations not in the original data. This requires construction of a new data frame that contains a component with the same name as the name of the predictor variable in the original data. This is illustrated using the Mileage data. Suppose we wish to obtain confidence intervals for a car whose weight is 2500.

W = data.frame(Weight=2500) #obtain 95% confidence interval for mean mileage of all cars whose weight is 2500 W.conf = predict(Mileage.lm,newdata=W,interval="confidence") print(round(W.conf,2)) #obtain 95% confidence interval for mileage of this car W.pred = predict(Mileage.lm,newdata=W,interval="prediction") print(round(W.pred,2))Note that the prediction interval is much wider than the confidence interval.

Now suppose we wish to plot the data and superimpose the regression line along with curves representing the confidence intervals and prediction intervals for this data. We need to obtain predictions for a grid of values that cover the range of weights in the data. This can be accomplished as follows.

wgt.range = range(Cars$Weight) W = data.frame(Weight=seq(wgt.range[1],wgt.range[2],length=250)) W.conf = predict(Mileage.lm,newdata=W,interval="confidence") W.pred = predict(Mileage.lm,newdata=W,interval="prediction") Y.lim = range(W.pred) #needed to ensure plot will contain all of prediction curves pcol = c("red","DarkGreen","blue") plot(Mileage ~ Weight,data=Cars,pch=19,ylim=Y.lim) title("Mileage vs Weight\nwith 95% Prediction and Confidence Intervals") abline(coef(Mileage.lm),col=pcol[1]) lines(W$Weight,W.conf[,"lwr"],col=pcol[2]) lines(W$Weight,W.conf[,"upr"],col=pcol[2]) lines(W$Weight,W.pred[,"lwr"],col=pcol[3]) lines(W$Weight,W.pred[,"upr"],col=pcol[3]) legend(wgt.range[1],Y.lim[1],legend=c("Fit","Confidence","Prediction"), lty=1,col=pcol,yjust=0,cex=.8)

It is interesting to compare these intervals with what could be done if we only had the sample of
60 gas mileages for these cars. In that case we only can estimate the overall mean mileage and the
estimate of the mileage for all cars would be the same. However, the standard errors would
be different. The standard error for the mean is and the standard error for an
individual is . We can obtain these in **R** by fitting a null model that only includes
the intercept.

M0.lm = lm(Mileage ~ 1, data=Cars) #1 indicates intercept only model W0.conf = predict(M0.lm,newdata=W,interval="confidence") W0.pred = predict(M0.lm,newdata=W,interval="prediction") #add these to the previous plot pcol = c("red","DarkGreen","blue") plty = 2:4 plot(Mileage ~ Weight,data=Cars,pch=19,ylim=Y.lim) title("Mileage vs Weight\nwith 95% Prediction and Confidence Intervals") abline(coef(Mileage.lm),col=pcol[1]) lines(W$Weight,W.conf[,"lwr"],col=pcol[2]) lines(W$Weight,W.conf[,"upr"],col=pcol[2]) lines(W$Weight,W.pred[,"lwr"],col=pcol[3]) lines(W$Weight,W.pred[,"upr"],col=pcol[3]) abline(h=W0.conf[1,"fit"],col=pcol[1],lty=plty[1]) abline(h=W0.conf[1,c("lwr","upr")],col=pcol[2],lty=plty[2]) abline(h=W0.pred[1,c("lwr","upr")],col=pcol[3],lty=plty[3]) abline(v=mean(Cars$Weight),col="gray50",lty=2) legend(wgt.range[1],Y.lim[1],legend=c("Fit","Confidence","Prediction"), lty=1,col=pcol,yjust=0,cex=.8,title="Full Model") legend(2400,Y.lim[1],legend=c("Fit","Confidence","Prediction"), lty=plty,col=pcol,yjust=0,cex=.8,title="No X")

2015-12-03