next up previous
Next: Assignments Up: stat6341 Previous: Variable selection

Multicollinearity and Principal Components

Multicollinearity in regression refers to a situation in which a subset of the predictor variables is nearly linearly dependent. This implies that the data matrix is nearly rank-deficient. That creates two problems in regression. The first, and more serious, problem is that near rank-deficiency produces numerically unstable least squares solution. Suppose for example that the first three variables in a data matrix are very close to linearly dependent. That is, there exists a non-zero vector c such that

\begin{displaymath}
c_1X_1 + c_2X_2 + c_3X_3 = \epsilon,
\end{displaymath}

with

\begin{displaymath}
\Vert\epsilon\Vert \approx 0.
\end{displaymath}

This implies that the 3rd diagonal of R in the QRD of the data matrix will be approximately 0. The backsolve algorithm to obtain the least squares estimate of regression coefficients involves division by those diagonal elements, resulting in numerical instability of the solution.

Another problem occurs even if the multicollinearity is not severe enough to cause numerical instability. Suppose the first two variables are highly correlated. This implies that if we know the value of the first variable, we can obtain a fairly accurate estimate of the value of the second variable. However, it still is possible that the second variable would be selected by a variable selection algorithm in spite of this strong correlation. Recall that in the forward stepwise algorithm after a variable is selected for the model, the remaining variables are projected onto the orthogonal complement of the range of variables in the current model. That corresponds to fitting regression models to predict each of the remaining variables from variables in the current model and then extracting the residuals from those fits. This removes all of the linear relationships the remaining variables have with variables in the current model. However, some of those residuals may have sufficiently strong linear relationships with the response variable to justify addition of the second variable even though it is highly correlated with the first variable selected. The problem occurs when we attempt to interpret the model coefficients.

Ordinarily we intrepret regression coefficients separately as slopes. Suppose for example the coefficient for the first variable is 1.5 and the correlation between the first and second variables is 0.9. Then we would say that the response variable increases an average of 1.5 for every unit increase in the first variable. We could interpret the second variable similarly, but the strong correlation between the variables implies that the data would not have observations with high values for the first variable and low values for the second, and vice versa. Therefore, we cannot interpret coefficients of those variables separately.

Example. The sleep data set contains body weight, brain weight, sleep time, and lifespan of 54 mammals. The following code reads this data and constructs a pairs plot with homo sapiens colored red.

Sleep = read.table("http://www.utdallas.edu/~ammann/stat6341scripts/Sleep.data",
  header=TRUE,row.names=1)
n = dim(Sleep)[1]
sleep.color = rep("black",n)
hndx = match("homo sapiens",dimnames(Sleep)[[1]])
sleep.color[hndx] = "red"
par(oma=c(0,0,2,0))
pairs(Sleep,pch=19,col=sleep.color)
mtext("Pairs plot of sleep data",outer=TRUE,line=0.5,cex=1.2)

The compressed small values and stretched out high values for all these variables except Sleep indicate that those variables have exponential relationships. So we can create a new dataframe that contains log-transformed values.

library(MASS)
logSleep = Sleep
logSleep$BodyWgt = log(Sleep$BodyWgt)
logSleep$BrainWgt = log(Sleep$BrainWgt)
logSleep$LifeSpan = log(Sleep$LifeSpan)
pairs(logSleep,pch=19,col=sleep.color)
mtext("Pairs plot of logSleep",outer=TRUE,line=0.5,cex=1.2)
# that looks reasonable so fit linear model to predict LifeSpan
LifeSpan.lm = lm(LifeSpan ~ .,data=logSleep)
plot(LifeSpan.lm,main="Full model: data=logSleep",pch=19,col=sleep.color)
LifeSpan.step = step(LifeSpan.lm,k=log(n)) #backward stepwise selection with BIC
print(summary(LifeSpan.step))
plot(LifeSpan.step,main="Reduced model: data=logSleep",pch=19,col=sleep.color)
#forward stepwise requires the scope argument and starts with the null model
LifeSpan.lm0 = lm(LifeSpan ~ 1,data=logSleep)
LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward",
   scope= ~BodyWgt+BrainWgt+Sleep,k=log(n))
#note that the upper formula cannot use . as a shortcut
print(summary(LifeSpan.forward))

R usage. When performing forward stepwise selection, the scope argument must be included. If you wish to start with the intercept only model as above, then that model is used as the object argument and the scope argument only needs to give the formula for the upper model. If the upper model contains more than a few variables, this formula can be constructed using paste and as.formula as shown below.

ndx = match("LifeSpan",names(logSleep)) #get index for response variable
#don't include response in formula!
upper.form = as.formula(paste("~",paste(names(logSleep)[-ndx],collapse="+")))
LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward",
   scope= upper.form,k=log(n))

In this case the model selected by forward stepwise is the same as the model obtained by backward selection, but that may not always be the case. We use forward stepwise when we wish to only add important variables and we use backward stepwise when we want to start with the full model and then remove unimportant variables.

The correlation between log(BodyWgt) and log(BrainWgt) is 0.9587, so 91.9% of the variability in log(BodyWgt) can be explained by a linear relationship between the two variables. The correlation between the residuals of that fit and log(LifeSpan) is -0.1765. This correlation is referred to as a partial correlation and is sufficiently strong to overcome the BIC penalty when this variable is added to log(BrainWgt)

The correlations among these three variables are all positive,

  BodyWgt BrainWgt LifeSpan
BodyWgt 1.0000 0.9587 0.7070
BrainWgt 0.9587 1.0000 0.7899
LifeSpan 0.7070 0.7899 1.0000

However, the coefficient for BodyWgt in negative. This illustrates that correlation coefficients do not tell the whole story about relationships among more than two variables.

The residual plots showed that the two brown bats were high residual/high leverage observations. Bats have highly developed brain functions for processing their high frequency sonar signals, and their bodies must be extremely light to enable flight. So it would be reasonable to remove them from the data. Also, the Echidna is a marsupial, so it will be removed as well.

bndx = grep("brown bat",dimnames(logSleep)[[1]])
bndx = c(bndx,match("Echidna",dimnames(logSleep)[[1]]))
logSleep1 = logSleep[-bndx,]
n1 = dim(logSleep1)[[1]]
sleep.color1 = sleep.color[-bndx]
LifeSpan1.lm = lm(LifeSpan ~ .,data=logSleep1)
plot(LifeSpan1.lm,main="Full model: data=logSleep with outliers removed",
    col=sleep.color1,pch=19)
LifeSpan1.step = step(LifeSpan1.lm,k=log(n1))
plot(LifeSpan1.step,main="Reduced model: data=logSleep with brown bats removed",
    col=sleep.color1,pch=19)
print(summary(LifeSpan1.step))

If the multicollinearity is severe, we can consider principal components regression. This involves obtaining the SVD of the centered data matrix and using the matrix of right singular vectors (eigenvectors of the covariance matrix) to rotate the centered data matrix. Note that these new variables are uncorrelated and are referred to as principal components. If any of the singular values are close to 0, we can reduce dimensions by removing the corresponding principal components and then use the remaining PCs as predictor variables in regression to predict the response variable. This is equivalent to using the Moore-Penrose generalized inverse of $X^TX$ to obtain the regression estimates instead of the ordinary inverse.

Example. The MASS library contains a data set named Boston. This data contains 506 observations on 14 variables related to housing prices, one of which is categorical. The last variable, medv gives the median value of owner-occupied homes and will be treated as the response variable. The script below performs some initial pre-processing of variables to remove categorical and nearly categorical variables, applies a log-transformation to crim, and rescales some variables. An initial fit showed that rm has a quadratic relationship with the response, so $rm^2$ was added to the data matrix. The initial fit also revealed some outliers which were then removed. The final fit was obtained and backward stepwise selection was performed using the BIC selection criterion.

Next, principal components regression was applied to this data. The details of this process are given here, but most of that is done within the function prcomp. The data is centered by subracting column means, and then the SVD of the centered data matrix is obtained. The matrix of right singular vectors is used to rotate the centered data. Note that the s.d.'s of the principal components are the singular values divided by the square root of the sample size minus one.

library(MASS)
Boston1 = Boston[,-c(2,4,12,14)] #remove chas,zn,black,medv
Boston1[,1] = log(Boston[,1])
Boston1[,"tax"] = Boston1[,"tax"]/20 #rescale to rate per 200,000
Boston1[,"age"] = Boston1[,"age"]/10 #rescale age
Boston1[,"nox"] = Boston1[,"nox"]*10 #rescale nox
medianValue = Boston[,"medv"]
# histograms of variable 1 indicates log-transform needed
#fit linear model
MV.lm = lm(medianValue ~ .,data = Boston1)
Boston2 = cbind(Boston1,rm2=Boston1[,"rm"]^2)
MV2.lm = lm(medianValue ~ .,data = Boston2)
bndx = c(365:373) #remove outliers
Boston2a = Boston2[-bndx,]
medianValue2a = medianValue[-bndx]
n = dim(Boston2a)[1]
p = dim(Boston2a)[2]
MV2a.lm = lm(medianValue2a ~ .,data = Boston2a)
summary(MV2a.lm)
b.all = coef(MV2a.lm)[-1]
MV2a.step = stepAIC(MV2a.lm,k=log(n)) # bic backward stepwise
summary(MV2a.step)
b.step = rep(0,p)
names(b.step) = names(Boston2a)
b = coef(MV2a.step)[-1]
b.step[names(b)] = b
### PC regression
Bc = scale(Boston2a,scale=FALSE) # center only
Bc.svd = svd(Bc)
d = Bc.svd$d
V = Bc.svd$v
Bc.cvar = cumsum(d^2)/sum(d^2)
plot(Bc.cvar,type="b",ylab="",xlab="# of Principal Components",ylim=c(0.5,1))
title("% of Total Variance Explained vs Number of PCs")
Bcr = Bc %*% V #rotated data
print(d/sqrt(n-1)) #s.d.'s of PCs
Bc.pc = prcomp(Boston2a) #principal components is equivalent to previous
###
#fit principal components regression with all PCs
Bcr1 = as.data.frame(Bcr)
names(Bcr1) = paste("PC",seq(dim(Bcr1)[2]),sep="")
MVpc1.lm = lm(medianValue2a ~ .,data=Bcr1)
MVpc1.step = stepAIC(MVpc1.lm,k=log(n)) # bic backward selection
summary(MVpc1.step)
V1names = names(coef(MVpc1.step)[-1])
vndx = match(V1names,names(Bcr1))
V1 = V[,vndx] #obtain columns selected by stepAIC
b.pcr = as.vector(V1 %*% coef(MVpc1.step)[-1]) #back rotate PCR coefficients
names(b.pcr) = names(Boston2a)
rbind(b.all,b.step,b.pcr) #compare coefficients
sum(residuals(MV2a.lm)^2) # SSE for full model
sum(residuals(MV2a.step)^2) # SSE for reduced model
sum(residuals(MVpc1.step)^2) # SSE for reduced PCR model
# sum of squared differences between models
sum((fitted(MVpc1.step)-fitted(MV2a.lm))^2)
sum((fitted(MVpc1.step)-fitted(MV2a.step))^2)


next up previous
Next: Assignments Up: stat6341 Previous: Variable selection
Larry Ammann
2017-11-01