next up previous
Next: Polynomial Regression Up: Class Notes Previous: Nonparametric tests

Inference for Regression

In the first part of the course we discussed how to fit a least squares regression line from data, and how to use that line to make predictions. In this section we will continue that discussion with the goal of deriving confidence intervals for these predictions. As is the case with every statistical method, estimates of population characteristics are only as good as the statistical models that are the bases for the estimation process.

The situation we are addressing here is one in which we are interested in two characteristics, or variables, associated with a population, and we would like to use one of these variables to predict the value of the other for some individuals in the population. The process we must follow to accomplish this is outlined here.

  1. Construct a mathematical model for the relationship between the two variables in the population.
  2. Randomly select a sample from the population and use this sample to fit the model.
  3. Determine if the model adequately expresses the relationship between the variables based on the random sample. That is, determine whether or not the assumptions made to construct the model are reasonably satisfied for the random sample that is used for the fit. This step is absolutely critical to the success of the process, but the methods involved in the step are beyond the scope of this course. Part of the difficulty with this step is that it requires us to determine whether the model is wrong or the data is wrong if there are problems with the assumptions. This is not a straight-forward process that can be automated. Since these methods are beyond the scope of the course, we will not discuss them, but you should always be aware that Best Professional Practice requires that this step be performed. No statistical predictions are valid without this step.
  4. If the model is not adequate, then the model must be modified and refit, which returns us to the previous step.
  5. Once we have constructed a model whose assumptions are reasonable for the data of interest, we can use it to make predictions and to make inferences about the relationship between the variables.

For this course we assume that the relationship between the two variables of interest is linear, which can be expressed by the population model,

\begin{displaymath}
Y = \alpha + \beta X + \epsilon.
\end{displaymath}

This model describes the values of the three variables $X,Y,\epsilon$ for a randomly selected individual in the population. Y is the variable whose value we wish to predict and X is the variable that is used to make this prediction. The variable $\epsilon$ is referred to as error. It does not necessarily represent measurement error, but instead represents the cumulative effect of all unmeasured variables on the response variable Y, in addition to any measurement errors that may have occurred.

The basic assumptions that we make about this model and the population are that all of the relationship between X and Y can be represented by the linear equation,

\begin{displaymath}
Y = \alpha + \beta X,
\end{displaymath}

and that individual deviations from this relationship are unrelated to the values of X and to the values and errors of other individuals in the population. For the purpose of statistical estimation and inference, we also assume that the distribution of the errors can be approximated by a normal distribution with mean 0 and some s.d. $\sigma$. Specifically, we assume that

  1. The errors are statistically independent.
  2. The distribution of the error variables in the population is approximately normal, $N(0,\sigma^2)$.
In particular, the error variance, $\sigma^2$, is assumed to be the same for all X. This assumption is referred to as homogeneity of variance.

Since the behavior of the errors does not involve the X variables, then the linear function, $Y = \alpha + \beta X$, must contain all of the relationship between X and Y. Methods to identify appropriate prediction models and to verify their conditions are beyond the scope of this course. Nevertheless, reports that make inferences based on such prediction models should include a description of the results of these steps.

Now suppose we have a random sample of size n from the population and use least squares regression to fit this linear model,

\begin{displaymath}
\widehat{Y}_i = a + bX_i + e_i,\ \ 1\le i\le n,
\end{displaymath}

where

\begin{displaymath}
b = r\frac{s_y}{s_x},\ \ a = \overline{Y} - b\overline{X}.
\end{displaymath}

The differences between the actual and predicted Y-values, denoted by $e_i$, are referred to as the model residuals and represent estimates of the actual errors. Verification of model assumptions involves checking to see if the behavior of these residuals is related to X and to compare their distribution to the normal distribution. This is accomplished by plotting residuals versus X (or the fitted values) to check for curvature and homogeneity of variance and by a quantile-quantile plot of the residuals to check for normality. How to do this in R is shown below. Let Y.lm represent the result of fitting a linear regression model to predict Y from X.

Y.lm = lm(Y ~ X)
Y.res = residuals(Y.lm)
#check for homogeneity of variance
plot(Y.res ~ X, pch=19,ylab="Residuals")
#check for normality
qqnorm(Y.res,pch=19)

If the model assumptions are reasonable, then we can ask whether or not X is useful for predicting Y. This question can be answered by a test of the hypotheses:

\begin{eqnarray*}
H_0:\ \beta &=& 0\\
H_A:\ \beta &\ne& 0.
\end{eqnarray*}

To test these hypotheses we need to consider three sums of squared deviations:

\begin{eqnarray*}
TSS &=& \sum(Y_i-\overline{Y})^2\\
SSE &=& \sum(Y_i - \widehat{Y}_i)^2\\
SSR &=& \sum(\widehat{Y}_i-\overline{Y})^2
\end{eqnarray*}

These are referred to as total sum of squares, error sum of squares, regression sum of squares, respectfully. It can be shown that TSS = SSR + SSE and SSR, SSE are statistically independent. Also,

\begin{eqnarray*}
E(TSS) &=& (n-1)\sigma^2\\
E(SSE) &=& (n-2)\sigma^2\\
E(SSR) &=& \sigma^2 + \beta^2SS_X
\end{eqnarray*}

where

\begin{displaymath}
SS_X = \sum(X_i-\overline{X})^2 = (n-1)s_X^2.
\end{displaymath}

Therefore, under the null hypothesis,

\begin{displaymath}
s_e^2 = SSE/(n-2)
\end{displaymath}

and SSR are independent estimates of $\sigma^2$. However, under the alternative hypothesis, SSR is an estimate of

\begin{displaymath}
\sigma^2 + \beta^2SS_X > \sigma^2.
\end{displaymath}

This implies that we can test those hypotheses using a 1-sided F-test with test statistic

\begin{displaymath}
F = \frac{SSR}{s_e^2}.
\end{displaymath}

The p-value for this test is the area to the right of F in the F-distribution with d.f. (1, n-2). In R this is obtained by

\begin{displaymath}
1 - pf(F,1,n-2).
\end{displaymath}

The test statistic and p-value are reported in R by the summary of a linear model fit,
summary(Y.lm)

Example. The file,
http://www.utdallas.edu/~ammann/stat3355scripts/mileage.csv
contains weights, engine displacements, and gas mileages for 60 cars. Suppose we wish to examine how mileage depends on weight.

Cars = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/mileage.csv",
          header=TRUE,sep=",",row.names=1)
Mileage.lm = lm(Mileage ~ Weight,data=Cars) #fit model
#check for curvature and homogeneity of variance
plot(residuals(Mileage.lm) ~ Weight,data=Cars,pch=19,ylab="Residuals")
abline(h=0,col="red")
title("Residuals vs Weight")
#check for normality of residuals
qqnorm(residuals(Mileage.lm),pch=19,main="Quantile-Quantile Plot of Residuals") #check assumptions
qqline(residuals(Mileage.lm))
#print summary
summary(Mileage.lm)
The residual plots look reasonable. The summary reports that the F-statistic is 148.3 with 1, 58 d.f. and the p-value is essentially 0. So we reject the null hypothesis that the slope of the linear relationship is 0. R-squared is 0.7189 so about 72% of the variability in gas mileage can be explained by this linear relationship and 28% is due to other factors.

Now consider using the linear regression model for prediction. The predicted Y-value is an estimate of two population quantities.

  1. The expected response, $E(Y_0) = \alpha + \beta X_0$. This represents the population mean value of Y for all individuals with $X=X_0$.
  2. The value of Y for an individual in the population with $X=X_0$,

    \begin{displaymath}
Y_0 = \alpha + \beta X_0 + \epsilon_0.
\end{displaymath}

Since we have no information about $\epsilon_0$ for an individual, our best estimate of it is 0, so our estimates of these two quantities are the same.

We can describe the accuracy of these predicted values by constructing confidence intervals for them. In both cases, these confidence intervals have the form,

\begin{displaymath}
\widehat{Y} \pm t_{n-2}se,
\end{displaymath}

where $t_{n-2}$ is the appropriate value from the t-distribution with n-2 d.f. corresponding to the required level of confidence and se is the standard error for the prediction. However, the standard errors for these confidence intervals are different depending on whether the estimate is for the mean response or for the response of a particular individual. Note that the uncertainty about the mean response at a particular value of X is due entirely to the uncertainty about the regression coefficients, but the uncertainty about the response for a particular individual also must include the uncertainty about the error for that individual in addition to the uncertainty about the regression coefficients.

Specifically, a confidence interval for the mean response at $X=X_0$ is given by

\begin{displaymath}
\widehat{Y} \pm t_{n-2}s_m,
\end{displaymath}

where

\begin{displaymath}
s_m^2 = s_e^2 \left[\frac{1}{n} + \frac{(X_0 - \overline{X})^2}{(n-1)s_x^2}\right],
\end{displaymath}

and $s_x^2$ is the sample variance of the X's. The confidence interval for an individual response (referred to as a prediction interval), at $X=X_0$ is

\begin{displaymath}
\widehat{Y} \pm t_{n-2}s_p,
\end{displaymath}

where

\begin{displaymath}
s_p^2 = s_e^2 + s_m^2.
\end{displaymath}

It is important to remember that these intervals can be calculated for any set of data. However, they are confidence or prediciton intervals only if the underlying assumptions are satisfied. If not, these intervals are meaningless.

These confidence intervals can be obtained in R using the predict() function. By default, this returns just the predicted values for the data when its argument is a linear model object. Confidence and prediction intervals require additional arguments.

predict(Y.lm,interval="confidence") #confidence interval for mean responses in the data
predict(Y.lm,interval="prediction") #confidence interval for individual responses in the data
The value returned in each case is a matrix with columns, "fit", "lwr", "upr".

We also can obtain confidence intervals for predictions of new observations not in the original data. This requires construction of a new data frame that contains a component with the same name as the name of the predictor variable in the original data. This is illustrated using the Mileage data. Suppose we wish to obtain confidence intervals for a car whose weight is 2500.

W = data.frame(Weight=2500)
#obtain 95% confidence interval for mean mileage of all cars whose weight is 2500
W.conf = predict(Mileage.lm,newdata=W,interval="confidence")
print(round(W.conf,2))
#obtain 95% confidence interval for mileage of this car
W.pred = predict(Mileage.lm,newdata=W,interval="prediction")
print(round(W.pred,2))
Note that the prediction interval is much wider than the confidence interval.

Now suppose we wish to plot the data and superimpose the regression line along with curves representing the confidence intervals and prediction intervals for this data. We need to obtain predictions for a grid of values that cover the range of weights in the data. This can be accomplished as follows.

wgt.range = range(Cars$Weight)
W = data.frame(Weight=seq(wgt.range[1],wgt.range[2],length=250))
W.conf = predict(Mileage.lm,newdata=W,interval="confidence")
W.pred = predict(Mileage.lm,newdata=W,interval="prediction")
Y.lim = range(W.pred) #needed to ensure plot will contain all of prediction curves
pcol = c("red","DarkGreen","blue")
plot(Mileage ~ Weight,data=Cars,pch=19,ylim=Y.lim)
title("Mileage vs Weight\nwith 95% Prediction and Confidence Intervals")
abline(coef(Mileage.lm),col=pcol[1])
lines(W$Weight,W.conf[,"lwr"],col=pcol[2])
lines(W$Weight,W.conf[,"upr"],col=pcol[2])
lines(W$Weight,W.pred[,"lwr"],col=pcol[3])
lines(W$Weight,W.pred[,"upr"],col=pcol[3])
legend(wgt.range[1],Y.lim[1],legend=c("Fit","Confidence","Prediction"),
      lty=1,col=pcol,yjust=0,cex=.8)

Image Mileage1

It is interesting to compare these intervals with what could be done if we only had the sample of 60 gas mileages for these cars. In that case we only can estimate the overall mean mileage and the estimate of the mileage for all cars would be the same. However, the standard errors would be different. The standard error for the mean is $s_y/\sqrt{n}$ and the standard error for an individual is $s_y$. We can obtain these in R by fitting a null model that only includes the intercept.

M0.lm = lm(Mileage ~ 1, data=Cars) #1 indicates intercept only model
W0.conf = predict(M0.lm,newdata=W,interval="confidence")
W0.pred = predict(M0.lm,newdata=W,interval="prediction")
#add these to the previous plot
pcol = c("red","DarkGreen","blue")
plty = 2:4
plot(Mileage ~ Weight,data=Cars,pch=19,ylim=Y.lim)
title("Mileage vs Weight\nwith 95% Prediction and Confidence Intervals")
abline(coef(Mileage.lm),col=pcol[1])
lines(W$Weight,W.conf[,"lwr"],col=pcol[2])
lines(W$Weight,W.conf[,"upr"],col=pcol[2])
lines(W$Weight,W.pred[,"lwr"],col=pcol[3])
lines(W$Weight,W.pred[,"upr"],col=pcol[3])
abline(h=W0.conf[1,"fit"],col=pcol[1],lty=plty[1])
abline(h=W0.conf[1,c("lwr","upr")],col=pcol[2],lty=plty[2])
abline(h=W0.pred[1,c("lwr","upr")],col=pcol[3],lty=plty[3])
abline(v=mean(Cars$Weight),col="gray50",lty=2)
legend(wgt.range[1],Y.lim[1],legend=c("Fit","Confidence","Prediction"),
      lty=1,col=pcol,yjust=0,cex=.8,title="Full Model")
legend(2400,Y.lim[1],legend=c("Fit","Confidence","Prediction"),
      lty=plty,col=pcol,yjust=0,cex=.8,title="No X")

Image Mileage2


next up previous
Next: Polynomial Regression Up: Class Notes Previous: Nonparametric tests
Larry Ammann
2014-12-08