The automobile dataset given above includes both Weight and Mileage of 60 automobiles. In addition to describing location and dispersion for each variable separately, we also may be interested in what kind of relationship exists between these variables. The following figure represents a scatterplot of these variables with the respective means superimposed. This shows that for a high percentage of cars, those with above average Weight tend to have below average Mileage, and those with below average Weight have above average Mileage. This is an example of a decreasing relationship, and most of the data points in the plot fall in the upper left/lower right quadrants. In an increasing relationship, most of the points will fall in the lower left/upper right quadrants.

We can derive a measure of association for two variables by considering the
deviations of the data values from their respective means. Note that the
product of deviations for a data point in the lower left or upper right quadrants
is positive and the product of deviations for a data point in the upper left or lower
right quadrants is negative. Therefore, most of these products for variables with a
strong increasing relationship will be positive, and most of these products for
variables with a strong decreasing relationship will be negative. This implies that
the sum of these products will be a large positive number for variables that have
a strong increasing relationship, and the sum will be a large negative number for
variables that have a strong decreasing relationship. This is the motivation for using

as a measure of association between two variables. This quantity is called the

If the correlation coefficient is close to 1, then the variables have a strong
increasing relationship and if the correlation coefficient is close to -1,
then the variables have a strong decreasing relationship. If the correlation is
exactly 1 or -1, then the data must fall exactly on a straight line. The
correlation coefficient is limited in that it is only valid for *linear*
relationships. A correlation coefficient close to 0 indicates that there is no
*linear* relationship. There may be a strong relationship in this case,
just not linear. Furthermore, the correlation may understate the strength of
the relationship even when *r* is large, if the relationship is
non-linear.

The correlation coefficient between Weight and Mileage is -0.848. This is a fairly large negative number, and so there is a fairly strong linear, decreasing relationship between Weight and Mileage. This is confirmed by the scatterplot. Since these variables are so strongly related, we can ask how well can we predict Mileage just by knowing the Weight of a vehicle. To answer this question, we first define a measure of distance between a dataset and a line.

Suppose we have measured two variables for each individual in a sample, denoted
by
, and we wish to predict the value of
*Y* given the value of *X* for a particular individual using
a straight line for the prediction. A reasonable approach would be to use the
line that comes closest to the data for this prediction. Let *Y=a+bX*
denote the equation of a prediction line, and let
denote the
predicted value of *Y* for . The difference between an actual and
predicted *Y*-value represents the error of prediction for that data
point. We define the *distance* between a prediction line and a point in
the dataset to be the square of the prediction error for that observation. The
total distance between the actual and predicted *Y*-values is then the
sum of the squared errors, which is the variance of the prediction errors
multiplied by . Since the predicted values, and hence the errors, depend on
the slope and intercept of the prediction line, we can express this total
distance by

Our goal now is to find the line that is closest to the data using this definition of distance. This line has slope and intercept that minimize . We can use differential calculus to find the minimum.

Setting these equal to 0 gives the system of equations

Therefore,

and, after substituting for in the second equation and solving for ,

It can be shown that the numerator equals and the denominator equals . Hence,

The prediction line, referred to as the

The next question that can be asked related to this prediction problem is how
well does the prediction line predict? We can't answer that question completely
yet because the full answer requires inference tools that we have not yet
covered, but we can give a descriptive answer to this question. The distance
measure, *D(a,b)*, represents the variance of the prediction errors.
One way of describing how well the prediction line performs is to compare it to
the best prediction we could obtain without using the *X* values to
predict. In that case, our predictor would be a single number. We have already
seen that the closest single number to a dataset is the mean of the data, so in
this case, the best predictor based only on the *Y* values is
. This corresponds to a horizontal line with intercept
, and so the distance between this line and the data is
. This quantity represents the error variance for the best
predictor that does not make use of the *X* values, and so the
difference,

represents the reduction in error variance (improvement in prediction) that results from use of the

then this is the percent of the error variance that can be removed if we use the least squares regression line to predict as opposed to simply using the mean of the

R-squared also can be interpreted as the proportion of variability in the

In the automobile example, the correlation between Weight and Mileage was
*r = -0.848*, and so . If we use the regression line to
predict Mileage based on Weight, we can remove 71.9% of the variance of the
Mileage data by using Weight to predict Mileage. Another way of expressing this
is to ask: Why don't all cars have the same mileage. Part of the answer to that
question is that cars don't all weigh the same and there is a fairly strong
linear relationship between weight and mileage that accounts for 71.9% of the
variability in mileage. This leaves 28.1% of this variability that is related
to other factors, including the possibility of a non-linear relationship
between Mileage and Weight.

To help judge the adequacy of a linear regression fit, we can plot the residuals vs the predictor variable . The residuals are the prediction errors, , . If a linear fit is reasonable, then the residuals should have no discernable relationship with and should be essentially noise. This plot for a linear fit to predict Mileage based on Weight is shown below.

This shows that the residuals are still related to Weight, so a linear fit is not adequate. Note that removal of the linear component of the relationship between weight and mileage, as represented by the residuals from a linear fit, does a better job of revealing this non-linearity than a scatterplot of these variables. This will be discussed in greater detail later.

Now suppose we only wish to consider cars whose engine displacements are no more than 225. We can define a logical expression that represents such cars and use that to subset the fuel data frame:

ndx = fuel.frame$Disp < 225 fuel1 = fuel.frame[ndx,]Then we can use the

plot(Mileage ~ Weight,data=fuel1,pch=19) title("Scatterplot of Weight vs Mileage") Disp.lm = lm(Mileage~Weight,data=fuel1) Disp.coef = coef(Disp.lm) abline(Disp.coef,col="red") plot(residuals(Disp.lm) ~ Weight,data=fuel1,pch=19,ylab="Residuals") abline(h=0,col="red") title("Residuals vs Weight\nData = fuel1")

The ideal situation is that the only thing left after we remove the linear relationship from the response variable, Mileage, is noise.

# qqnorm plot qqnorm(residuals(Disp.lm),pch=19) qqline(residuals(Disp.lm),col="red")

It is important to remember that correlation is a mathematical concept that
says nothing about causation. The presence of a strong correlation between
two variables indicates that there *may* be a causal relationship,
but does not prove that one exists, nor does it indicate the direction of any
causality. **Read pages 266-7 in the textbook for a more thorough
discussion of this and related issues**.

The **R** code to generate the graphics in this section can be found
at:

http://www.utdallas.edu/~ammann/stat3355scripts/NumericGraphics.r

2013-12-17