next up previous
Next: Introduction to Probability Models Up: Numerical summaries of data Previous: Measures of Dispersion

Measures of Association

The Cars dataset given above includes both Weight and Mileage of 60 automobiles. In addition to describing location and dispersion for each variable separately, we also may be interested in what kind of relationship exists between these variables. The following figure represents a scatterplot of these variables with the respective means superimposed. This shows that for a high percentage of cars, those with above average Weight tend to have below average Mileage, and those with below average Weight have above average Mileage. This is an example of a decreasing relationship, and most of the data points in the plot fall in the upper left/lower right quadrants. In an increasing relationship, most of the points will fall in the lower left/upper right quadrants.

Image stat3355num6

Image stat3355num7

We can derive a measure of association for two variables by considering the deviations of the data values from their respective means. Note that the product of deviations for a data point in the lower left or upper right quadrants is positive and the product of deviations for a data point in the upper left or lower right quadrants is negative. Therefore, most of these products for variables with a strong increasing relationship will be positive, and most of these products for variables with a strong decreasing relationship will be negative. This implies that the sum of these products will be a large positive number for variables that have a strong increasing relationship, and the sum will be a large negative number for variables that have a strong decreasing relationship. This is the motivation for using

$\displaystyle r = \frac{\frac{1}{N}\sum_{i=1}^N(X_i-\mu_x)(Y_i-\mu_y)}{\sigma_x \sigma_y}.

as a measure of association between two variables. This quantity is called the correlation coefficient. The denominator of r is a scale factor that makes the correlation coefficient dimension-less and scales so that $ 0\le \vert r\vert \le 1$ . Note that this is defined in terms of population parameters $ \mu_x,\sigma_x,\mu_y,\sigma_y$ . It can be expressed equivalently for samples as

$\displaystyle r = \frac{\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})(Y_i-\overline{Y})}{s_xs_y}.

If the correlation coefficient is close to 1, then the variables have a strong increasing relationship and if the correlation coefficient is close to -1, then the variables have a strong decreasing relationship. If the correlation is exactly 1 or -1, then the data must fall exactly on a straight line. The correlation coefficient is limited in that it is only valid for linear relationships. A correlation coefficient close to 0 indicates that there is no linear relationship. There may be a strong relationship in this case, just not linear. Furthermore, the correlation may understate the strength of the relationship even when r is large, if the relationship is non-linear.

The correlation coefficient between Weight and Mileage is -0.848. This is a fairly large negative number, and so there is a fairly strong linear, decreasing relationship between Weight and Mileage. This is confirmed by the scatterplot. Since these variables are so strongly related, we can ask how well can we predict Mileage just by knowing the Weight of a vehicle. To answer this question, we first define a measure of distance between a dataset and a line.

Suppose we have measured two variables for each individual in a sample, denoted by $ \{(X_1,Y_1),\cdots,(X_n,Y_n)\}$ , and we wish to predict the value of Y given the value of X for a particular individual using a straight line for the prediction. A reasonable approach would be to use the line that comes closest to the data for this prediction. Let Y=a+bX denote the equation of a prediction line, and let $ \hat{Y}_i=a+bX_i$ denote the predicted value of Y for $ X_i$ . The difference between an actual and predicted Y-value represents the error of prediction for that data point. We define the distance between a prediction line and a point in the dataset to be the square of the prediction error for that observation. The total distance between the actual and predicted Y-values is then the sum of the squared errors, which is the variance of the prediction errors multiplied by $ n$ . Since the predicted values, and hence the errors, depend on the slope and intercept of the prediction line, we can express this total distance by

$\displaystyle D(a,b) = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n (Y_i - a - bX_i)^2.

Our goal now is to find the line that is closest to the data using this definition of distance. This line has slope and intercept that minimize $ D(a,b)$ . We can use differential calculus to find the minimum.

$\displaystyle \frac{\partial}{\partial a}D(a,b) = -2\sum_{i=1}^n (Y_i - a - bX_i),

$\displaystyle \frac{\partial}{\partial b}D(a,b) = -2\sum_{i=1}^n X_i(Y_i - a - bX_i).

Setting these equal to 0 gives the system of equations

$\displaystyle 0 = \sum_{i=1}^n (Y_i - a - bX_i) = n(\overline{Y} - b\overline{X} - a),

$\displaystyle 0 = \sum_{i=1}^n X_iY_i - na\overline{X} - b\sum_{i=1}^n X_i^2.


$\displaystyle a = \overline{Y} - b\overline{X},

and, after substituting for $ a$ in the second equation and solving for $ b$ ,

$\displaystyle b = \frac{\sum_{i=1}^n X_iY_i - n\overline{X}\overline{Y}}{\sum_{i=1}^n X_i^2 - n\overline{X}^2}.

It can be shown that the numerator equals $ (n-1)rs_xs_y$ and the denominator equals $ (n-1)s_x^2$ . Hence,

$\displaystyle b = r\frac{s_y}{s_x},\ \ a = \overline{Y}-b\overline{X}.

The prediction line, referred to as the least squares regression line, is then

$\displaystyle \hat{Y} = a + bX.

Image stat3355num8

To help judge the adequacy of a linear regression fit, we can plot the residuals vs the predictor variable $ X$ . The residuals are the prediction errors, $ e_i = Y_i - \hat{Y}_i$ , $ 1\le i\le n$ . If a linear fit is reasonable, then the residuals should have no discernable relationship with $ X$ and should be essentially noise. This plot for a linear fit to predict Mileage based on Weight is shown below.

Image stat3355num10

This shows that the residuals are still related to Weight, so a linear fit is not adequate. Note that removal of the linear component of the relationship between weight and mileage, as represented by the residuals from a linear fit, does a better job of revealing this non-linearity than a scatterplot of these variables. This will be discussed in greater detail later.

Now suppose we only wish to consider cars that are not Vans. We can define a logical expression that represents such cars and use that to subset the Cars data frame:

Cars = read.table("",header=TRUE,sep=",",row.names=1)
ndx = Cars$Type != "Van"
Cars1 = Cars[ndx,]
Then we can use the Cars1 data frame to plot Mileage versus Weight and to fit a linear regression model.
plot(Mileage ~ Weight,data=Cars1,pch=20)
title("Scatterplot of Weight vs Mileage")
Disp.lm = lm(Mileage~Weight,data=Cars1)
Disp.coef = coef(Disp.lm)
plot(residuals(Disp.lm) ~ Weight,data=Cars1,pch=20,ylab="Residuals")
title("Residuals vs Weight\nData = Cars1")

It is important to remember that correlation is a mathematical concept that says nothing about causation. The presence of a strong correlation between two variables indicates that there may be a causal relationship, but does not prove that one exists, nor does it indicate the direction of any causality.

The next question that can be asked related to this prediction problem is how well does the prediction line predict? We can't answer that question completely yet because the full answer requires inference tools that we have not yet covered, but we can give a descriptive answer to this question. The distance measure, D(a,b), represents the variance of the prediction errors. One way of describing how well the prediction line performs is to compare it to the best prediction we could obtain without using the X values to predict. In that case, our predictor would be a single number. We have already seen that the closest single number to a dataset is the mean of the data, so in this case, the best predictor based only on the Y values is $ \overline{Y}$ . This corresponds to a horizontal line with intercept $ \overline{Y}$ , and so the distance between this line and the data is $ D(\overline{Y},0)$ . This quantity represents the error variance for the best predictor that does not make use of the X values, and so the difference,

$\displaystyle D(\overline{Y},0) - D(a,b),

represents the reduction in error variance (improvement in prediction) that results from use of the X values to predict. If we express this as a percent,

$\displaystyle 100\frac{D(\overline{Y},0) - D(a,b)}{D(\overline{Y},0)},

then this is the percent of the error variance that can be removed if we use the least squares regression line to predict as opposed to simply using the mean of the Y's. It can be shown that this quantity is equal to the square of the correlation coefficient,

$\displaystyle r^2 = \frac{D(\overline{Y},0) - D(a,b)}{D(\overline{Y},0)}.

R-squared also can be interpreted as the proportion of variability in the Y-variable that can be explained by the presence of a linear relationship between X and Y.

The file,
contains weight, city mileage, and highway mileage. A plot of each pair of variables in this data set can be displayed and the corresponding correlation coefficients obtained as follows:

MPG = read.table("",header=TRUE,sep=",",row.names=1)

The correlation between Weight and MPG.highway is -0.8033 and so r-squared is 0.6453. This implies the relationship between these variables is decreasing and 64.53% of the variability in MPG.highwway can be explained by the presence of a linear relationship between these variables. If we use the regression line to predict MPG.highway based on Weight, we can remove 64.53% of the variability in MPG.highway by using Weight to predict MPG.highway. Another way of expressing this is to ask: Why don't all cars have the same mileage? Part of the answer to that question is that cars don't all weigh the same and there is a fairly strong linear relationship between weight and highway mileage that accounts for 64.53% of the variability in mileage. This leaves 35.47% of this variability that is related to other factors, including the possibility of a non-linear relationship between these variables. This reduction in variability can be seen in the following plot. The first plot at upper left is a histogram of the deviations of MPG.highway about its mean. These represent the residuals when we use $ \overline{Y}$ to predict highway mileage. The plot below it is a histogram of the residuals when we use the least squares regression line to predict highway mileage based on weight. The second column of histograms compares the residuals about the mean to the regression residuals when is used to predict highway mileage.

Image MPG

The R code to generate the graphics in this section can be found at:

An example using the crabs data can be found at:

next up previous
Next: Introduction to Probability Models Up: Numerical summaries of data Previous: Measures of Dispersion