Multicollinearity in regression refers to a situation in which a subset of the predictor variables
is nearly linearly dependent. This implies that the data matrix is nearly rank-deficient. That
creates two problems in regression. The first, and more serious, problem is that near
rank-deficiency produces numerically unstable least squares solution. Suppose for example that the
first three variables in a data matrix are very close to linearly dependent. That is, there
exists a non-zero vector *c* such that

with

This implies that the 3rd diagonal of

Another problem occurs even if the multicollinearity is not severe enough to cause numerical instability. Suppose the first two variables are highly correlated. This implies that if we know the value of the first variable, we can obtain a fairly accurate estimate of the value of the second variable. However, it still is possible that the second variable would be selected by a variable selection algorithm in spite of this strong correlation. Recall that in the forward stepwise algorithm after a variable is selected for the model, the remaining variables are projected onto the orthogonal complement of the range of variables in the current model. That corresponds to fitting regression models to predict each of the remaining variables from variables in the current model and then extracting the residuals from those fits. This removes all of the linear relationships the remaining variables have with variables in the current model. However, some of those residuals may have sufficiently strong linear relationships with the response variable to justify addition of the second variable even though it is highly correlated with the first variable selected. The problem occurs when we attempt to interpret the model coefficients.

Ordinarily we intrepret regression coefficients separately as slopes. Suppose for example the
coefficient for the first variable is *1.5* and the correlation between the first and second
variables is *0.9*. Then we would say that the response variable increases an average of
*1.5* for every unit increase in the first variable. We could interpret the second variable
similarly, but the strong correlation between the variables implies that the data would not have
observations with high values for the first variable and low values for the second, and vice versa.
Therefore, we cannot interpret coefficients of those variables separately.

**Example**. The sleep data set contains body weight, brain weight, sleep time, and lifespan
of 54 mammals. The following code reads this data and constructs a pairs plot with
*homo sapiens* colored red.

Sleep = read.table("http://www.utdallas.edu/~ammann/stat6341scripts/Sleep.data", header=TRUE,row.names=1) n = dim(Sleep)[1] sleep.color = rep("black",n) hndx = match("homo sapiens",dimnames(Sleep)[[1]]) sleep.color[hndx] = "red" par(oma=c(0,0,2,0)) pairs(Sleep,pch=19,col=sleep.color) mtext("Pairs plot of sleep data",outer=TRUE,line=0.5,cex=1.2)

The compressed small values and stretched out high values for all these variables except Sleep indicate that those variables have exponential relationships. So we can create a new dataframe that contains log-transformed values.

library(MASS) logSleep = Sleep logSleep$BodyWgt = log(Sleep$BodyWgt) logSleep$BrainWgt = log(Sleep$BrainWgt) logSleep$LifeSpan = log(Sleep$LifeSpan) pairs(logSleep,pch=19,col=sleep.color) mtext("Pairs plot of logSleep",outer=TRUE,line=0.5,cex=1.2) # that looks reasonable so fit linear model to predict LifeSpan LifeSpan.lm = lm(LifeSpan ~ .,data=logSleep) plot(LifeSpan.lm,main="Full model: data=logSleep",pch=19,col=sleep.color) LifeSpan.step = step(LifeSpan.lm,k=log(n)) #backward stepwise selection with BIC print(summary(LifeSpan.step)) plot(LifeSpan.step,main="Reduced model: data=logSleep",pch=19,col=sleep.color) #forward stepwise requires the scope argument and starts with the null model LifeSpan.lm0 = lm(LifeSpan ~ 1,data=logSleep) LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward", scope= ~BodyWgt+BrainWgt+Sleep,k=log(n)) #note that the upper formula cannot use . as a shortcut print(summary(LifeSpan.forward))

**R usage**. When performing forward stepwise selection, the scope argument must be included.
If you wish to start with the intercept only model as above, then that model is used as the
*object* argument and the scope argument only needs to give the formula for the upper model.
If the upper model contains more than a few variables, this formula can be constructed using
*paste* and *as.formula* as shown below.

ndx = match("LifeSpan",names(logSleep)) #get index for response variable #don't include response in formula! upper.form = as.formula(paste("~",paste(names(logSleep)[-ndx],collapse="+"))) LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward", scope= upper.form,k=log(n))

In this case the model selected by forward stepwise is the same as the model obtained by backward selection, but that may not always be the case. We use forward stepwise when we wish to only add important variables and we use backward stepwise when we want to start with the full model and then remove unimportant variables.

The correlation between *log(BodyWgt)* and *log(BrainWgt)* is *0.9587*, so
91.9% of the variability in *log(BodyWgt)* can be explained by a linear relationship between
the two variables. The correlation between the residuals of that fit and *log(LifeSpan)* is
*-0.1765*. This correlation is referred to as a partial correlation and is sufficiently
strong to overcome the BIC penalty when this variable is added to *log(BrainWgt)*

The correlations among these three variables are all positive,

BodyWgt | BrainWgt | LifeSpan | |

BodyWgt | 1.0000 | 0.9587 | 0.7070 |

BrainWgt | 0.9587 | 1.0000 | 0.7899 |

LifeSpan | 0.7070 | 0.7899 | 1.0000 |

However, the coefficient for *BodyWgt* in negative. This illustrates that correlation
coefficients do not tell the whole story about relationships among more than two variables.

The residual plots showed that the two brown bats were high residual/high leverage observations. Bats have highly developed brain functions for processing their high frequency sonar signals, and their bodies must be extremely light to enable flight. So it would be reasonable to remove them from the data. Also, the Echidna is a marsupial, so it will be removed as well.

bndx = grep("brown bat",dimnames(logSleep)[[1]]) bndx = c(bndx,match("Echidna",dimnames(logSleep)[[1]])) logSleep1 = logSleep[-bndx,] n1 = dim(logSleep1)[[1]] sleep.color1 = sleep.color[-bndx] LifeSpan1.lm = lm(LifeSpan ~ .,data=logSleep1) plot(LifeSpan1.lm,main="Full model: data=logSleep with outliers removed", col=sleep.color1,pch=19) LifeSpan1.step = step(LifeSpan1.lm,k=log(n1)) plot(LifeSpan1.step,main="Reduced model: data=logSleep with brown bats removed", col=sleep.color1,pch=19) print(summary(LifeSpan1.step))

If the multicollinearity is severe, we can consider principal components regression. This involves obtaining the SVD of the centered data matrix and using the matrix of right singular vectors (eigenvectors of the covariance matrix) to rotate the centered data matrix. Note that these new variables are uncorrelated and are referred to as principal components. If any of the singular values are close to 0, we can reduce dimensions by removing the corresponding principal components and then use the remaining PCs as predictor variables in regression to predict the response variable. This is equivalent to using the Moore-Penrose generalized inverse of to obtain the regression estimates instead of the ordinary inverse.

**Example**. The **MASS** library contains a data set named *Boston*. This data
contains 506 observations on 14 variables related to housing prices, one of which is categorical.
The last variable, *medv* gives the median value of owner-occupied homes and will be treated
as the response variable. The script below performs some initial pre-processing of variables to
remove categorical and nearly categorical variables, applies a log-transformation to *crim*,
and rescales some variables. An initial fit showed that *rm* has a quadratic relationship
with the response, so was added to the data matrix. The initial fit also revealed
some outliers which were then removed. The final fit was obtained and backward stepwise selection
was performed using the BIC selection criterion.

Next, principal components regression was applied to this data. The details of this process are
given here, but most of that is done within the function *prcomp*. The data is centered by
subracting column means, and then the SVD of the centered data matrix is obtained. The matrix of
right singular vectors is used to rotate the centered data. Note that the s.d.'s of the principal
components are the singular values divided by the square root of the sample size minus one.

library(MASS) Boston1 = Boston[,-c(2,4,12,14)] #remove chas,zn,black,medv Boston1[,1] = log(Boston[,1]) Boston1[,"tax"] = Boston1[,"tax"]/20 #rescale to rate per 200,000 Boston1[,"age"] = Boston1[,"age"]/10 #rescale age Boston1[,"nox"] = Boston1[,"nox"]*10 #rescale nox medianValue = Boston[,"medv"] # histograms of variable 1 indicates log-transform needed #fit linear model MV.lm = lm(medianValue ~ .,data = Boston1) Boston2 = cbind(Boston1,rm2=Boston1[,"rm"]^2) MV2.lm = lm(medianValue ~ .,data = Boston2) bndx = c(365:373) #remove outliers Boston2a = Boston2[-bndx,] medianValue2a = medianValue[-bndx] n = dim(Boston2a)[1] p = dim(Boston2a)[2] MV2a.lm = lm(medianValue2a ~ .,data = Boston2a) summary(MV2a.lm) b.all = coef(MV2a.lm)[-1] MV2a.step = stepAIC(MV2a.lm,k=log(n)) # bic backward stepwise summary(MV2a.step) b.step = rep(0,p) names(b.step) = names(Boston2a) b = coef(MV2a.step)[-1] b.step[names(b)] = b ### PC regression Bc = scale(Boston2a,scale=FALSE) # center only Bc.svd = svd(Bc) d = Bc.svd$d V = Bc.svd$v Bc.cvar = cumsum(d^2)/sum(d^2) plot(Bc.cvar,type="b",ylab="",xlab="# of Principal Components",ylim=c(0.5,1)) title("% of Total Variance Explained vs Number of PCs") Bcr = Bc %*% V #rotated data print(d/sqrt(n-1)) #s.d.'s of PCs Bc.pc = prcomp(Boston2a) #principal components is equivalent to previous ### #fit principal components regression with all PCs Bcr1 = as.data.frame(Bcr) names(Bcr1) = paste("PC",seq(dim(Bcr1)[2]),sep="") MVpc1.lm = lm(medianValue2a ~ .,data=Bcr1) MVpc1.step = stepAIC(MVpc1.lm,k=log(n)) # bic backward selection summary(MVpc1.step) V1names = names(coef(MVpc1.step)[-1]) vndx = match(V1names,names(Bcr1)) V1 = V[,vndx] #obtain columns selected by stepAIC b.pcr = as.vector(V1 %*% coef(MVpc1.step)[-1]) #back rotate PCR coefficients names(b.pcr) = names(Boston2a) rbind(b.all,b.step,b.pcr) #compare coefficients sum(residuals(MV2a.lm)^2) # SSE for full model sum(residuals(MV2a.step)^2) # SSE for reduced model sum(residuals(MVpc1.step)^2) # SSE for reduced PCR model # sum of squared differences between models sum((fitted(MVpc1.step)-fitted(MV2a.lm))^2) sum((fitted(MVpc1.step)-fitted(MV2a.step))^2)

2017-11-01