The automobile dataset given above includes both Weight and Mileage of 60 automobiles. In addition to describing location and dispersion for each variable separately, we also may be interested in what kind of relationship exists between these variables. The following figure represents a scatterplot of these variables with the respective means superimposed. This shows that for a high percentage of cars, those with above average Weight tend to have below average Mileage, and those with below average Weight have above average Mileage. This is an example of a decreasing relationship, and most of the data points in the plot fall in the upper left/lower right quadrants. In an increasing relationship, most of the points will fall in the lower left/upper right quadrants.
We can derive a measure of association for two variables by considering the
deviations of the data values from their respective means. Note that the
product of deviations for a data point in the lower left or upper right quadrants
is positive and the product of deviations for a data point in the upper left or lower
right quadrants is negative. Therefore, most of these products for variables with a
strong increasing relationship will be positive, and most of these products for
variables with a strong decreasing relationship will be negative. This implies that
the sum of these products will be a large positive number for variables that have
a strong increasing relationship, and the sum will be a large negative number for
variables that have a strong decreasing relationship. This is the motivation for using
If the correlation coefficient is close to 1, then the variables have a strong increasing relationship and if the correlation coefficient is close to -1, then the variables have a strong decreasing relationship. If the correlation is exactly 1 or -1, then the data must fall exactly on a straight line. The correlation coefficient is limited in that it is only valid for linear relationships. A correlation coefficient close to 0 indicates that there is no linear relationship. There may be a strong relationship in this case, just not linear. Furthermore, the correlation may understate the strength of the relationship even when r is large, if the relationship is non-linear.
The correlation coefficient between Weight and Mileage is -0.848. This is a fairly large negative number, and so there is a fairly strong linear, decreasing relationship between Weight and Mileage. This is confirmed by the scatterplot. Since these variables are so strongly related, we can ask how well can we predict Mileage just by knowing the Weight of a vehicle. To answer this question, we first define a measure of distance between a dataset and a line.
Suppose we have measured two variables for each individual in a sample, denoted
, and we wish to predict the value of
Y given the value of X for a particular individual using
a straight line for the prediction. A reasonable approach would be to use the
line that comes closest to the data for this prediction. Let Y=a+bX
denote the equation of a prediction line, and let
predicted value of Y for . The difference between an actual and
predicted Y-value represents the error of prediction for that data
point. We define the distance between a prediction line and a point in
the dataset to be the square of the prediction error for that observation. The
total distance between the actual and predicted Y-values is then the
sum of the squared errors, which is the variance of the prediction errors
multiplied by . Since the predicted values, and hence the errors, depend on
the slope and intercept of the prediction line, we can express this total
The next question that can be asked related to this prediction problem is how
well does the prediction line predict? We can't answer that question completely
yet because the full answer requires inference tools that we have not yet
covered, but we can give a descriptive answer to this question. The distance
measure, D(a,b), represents the variance of the prediction errors.
One way of describing how well the prediction line performs is to compare it to
the best prediction we could obtain without using the X values to
predict. In that case, our predictor would be a single number. We have already
seen that the closest single number to a dataset is the mean of the data, so in
this case, the best predictor based only on the Y values is
. This corresponds to a horizontal line with intercept
, and so the distance between this line and the data is
. This quantity represents the error variance for the best
predictor that does not make use of the X values, and so the
In the automobile example, the correlation between Weight and Mileage was r = -0.848, and so . If we use the regression line to predict Mileage based on Weight, we can remove 71.9% of the variance of the Mileage data by using Weight to predict Mileage. Another way of expressing this is to ask: Why don't all cars have the same mileage. Part of the answer to that question is that cars don't all weigh the same and there is a fairly strong linear relationship between weight and mileage that accounts for 71.9% of the variability in mileage. This leaves 28.1% of this variability that is related to other factors, including the possibility of a non-linear relationship between Mileage and Weight.
To help judge the adequacy of a linear regression fit, we can plot the residuals vs the predictor variable . The residuals are the prediction errors, , . If a linear fit is reasonable, then the residuals should have no discernable relationship with and should be essentially noise. This plot for a linear fit to predict Mileage based on Weight is shown below.
This shows that the residuals are still related to Weight, so a linear fit is not adequate. Note that removal of the linear component of the relationship between weight and mileage, as represented by the residuals from a linear fit, does a better job of revealing this non-linearity than a scatterplot of these variables. This will be discussed in greater detail later.
Now suppose we only wish to consider cars whose engine displacements are no more than 225. We can define a logical expression that represents such cars and use that to subset the fuel data frame:
ndx = fuel.frame$Disp < 225 fuel1 = fuel.frame[ndx,]Then we can use the fuel1 data frame to plot Mileage versus Weight and to fit a linear regression model.
plot(Mileage ~ Weight,data=fuel1,pch=19) title("Scatterplot of Weight vs Mileage") Disp.lm = lm(Mileage~Weight,data=fuel1) Disp.coef = coef(Disp.lm) abline(Disp.coef,col="red") plot(residuals(Disp.lm) ~ Weight,data=fuel1,pch=19,ylab="Residuals") abline(h=0,col="red") title("Residuals vs Weight\nData = fuel1")
The ideal situation is that the only thing left after we remove the linear relationship from the response variable, Mileage, is noise.
# qqnorm plot qqnorm(residuals(Disp.lm),pch=19) qqline(residuals(Disp.lm),col="red")
It is important to remember that correlation is a mathematical concept that says nothing about causation. The presence of a strong correlation between two variables indicates that there may be a causal relationship, but does not prove that one exists, nor does it indicate the direction of any causality.
The R code to generate the graphics in this section can be found
An example using the crabs data can be found at: