Multicollinearity in regression refers to a situation in which a subset of the predictor variables
is nearly linearly dependent. This implies that the data matrix is nearly rank-deficient. That
creates two problems in regression. The first, and more serious, problem is that near
rank-deficiency produces numerically unstable least squares solution. Suppose for example that the
first three variables in a data matrix are very close to linearly dependent. That is, there
exists a non-zero vector c such that
Another problem occurs even if the multicollinearity is not severe enough to cause numerical instability. Suppose the first two variables are highly correlated. This implies that if we know the value of the first variable, we can obtain a fairly accurate estimate of the value of the second variable. However, it still is possible that the second variable would be selected by a variable selection algorithm in spite of this strong correlation. Recall that in the forward stepwise algorithm after a variable is selected for the model, the remaining variables are projected onto the orthogonal complement of the range of variables in the current model. That corresponds to fitting regression models to predict each of the remaining variables from variables in the current model and then extracting the residuals from those fits. This removes all of the linear relationships the remaining variables have with variables in the current model. However, some of those residuals may have sufficiently strong linear relationships with the response variable to justify addition of the second variable even though it is highly correlated with the first variable selected. The problem occurs when we attempt to interpret the model coefficients.
Ordinarily we intrepret regression coefficients separately as slopes. Suppose for example the coefficient for the first variable is 1.5 and the correlation between the first and second variables is 0.9. Then we would say that the response variable increases an average of 1.5 for every unit increase in the first variable. We could interpret the second variable similarly, but the strong correlation between the variables implies that the data would not have observations with high values for the first variable and low values for the second, and vice versa. Therefore, we cannot interpret coefficients of those variables separately.
Example. The sleep data set contains body weight, brain weight, sleep time, and lifespan of 54 mammals. The following code reads this data and constructs a pairs plot with homo sapiens colored red.
Sleep = read.table("http://www.utdallas.edu/~ammann/stat6341scripts/Sleep.data", header=TRUE,row.names=1) n = dim(Sleep) sleep.color = rep("black",n) hndx = match("homo sapiens",dimnames(Sleep)[]) sleep.color[hndx] = "red" par(oma=c(0,0,2,0)) pairs(Sleep,pch=19,col=sleep.color) mtext("Pairs plot of sleep data",outer=TRUE,line=0.5,cex=1.2)
The compressed small values and stretched out high values for all these variables except Sleep indicate that those variables have exponential relationships. So we can create a new dataframe that contains log-transformed values.
library(MASS) logSleep = Sleep logSleep$BodyWgt = log(Sleep$BodyWgt) logSleep$BrainWgt = log(Sleep$BrainWgt) logSleep$LifeSpan = log(Sleep$LifeSpan) pairs(logSleep,pch=19,col=sleep.color) mtext("Pairs plot of logSleep",outer=TRUE,line=0.5,cex=1.2) # that looks reasonable so fit linear model to predict LifeSpan LifeSpan.lm = lm(LifeSpan ~ .,data=logSleep) plot(LifeSpan.lm,main="Full model: data=logSleep",pch=19,col=sleep.color) LifeSpan.step = step(LifeSpan.lm,k=log(n)) #backward stepwise selection with BIC print(summary(LifeSpan.step)) plot(LifeSpan.step,main="Reduced model: data=logSleep",pch=19,col=sleep.color) #forward stepwise requires the scope argument and starts with the null model LifeSpan.lm0 = lm(LifeSpan ~ 1,data=logSleep) LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward", scope= ~BodyWgt+BrainWgt+Sleep,k=log(n)) #note that the upper formula cannot use . as a shortcut print(summary(LifeSpan.forward))
R usage. When performing forward stepwise selection, the scope argument must be included. If you wish to start with the intercept only model as above, then that model is used as the object argument and the scope argument only needs to give the formula for the upper model. If the upper model contains more than a few variables, this formula can be constructed using paste and as.formula as shown below.
ndx = match("LifeSpan",names(logSleep)) #get index for response variable #don't include response in formula! upper.form = as.formula(paste("~",paste(names(logSleep)[-ndx],collapse="+"))) LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward", scope= upper.form,k=log(n))
In this case the model selected by forward stepwise is the same as the model obtained by backward selection, but that may not always be the case. We use forward stepwise when we wish to only add important variables and we use backward stepwise when we want to start with the full model and then remove unimportant variables.
The correlation between log(BodyWgt) and log(BrainWgt) is 0.9587, so 91.9% of the variability in log(BodyWgt) can be explained by a linear relationship between the two variables. The correlation between the residuals of that fit and log(LifeSpan) is -0.1765. This correlation is referred to as a partial correlation and is sufficiently strong to overcome the BIC penalty when this variable is added to log(BrainWgt)
The correlations among these three variables are all positive,
However, the coefficient for BodyWgt in negative. This illustrates that correlation coefficients do not tell the whole story about relationships among more than two variables.
The residual plots showed that the two brown bats were high residual/high leverage observations. Bats have highly developed brain functions for processing their high frequency sonar signals, and their bodies must be extremely light to enable flight. So it would be reasonable to remove them from the data. Also, the Echidna is a marsupial, so it will be removed as well.
bndx = grep("brown bat",dimnames(logSleep)[]) bndx = c(bndx,match("Echidna",dimnames(logSleep)[])) logSleep1 = logSleep[-bndx,] n1 = dim(logSleep1)[] sleep.color1 = sleep.color[-bndx] LifeSpan1.lm = lm(LifeSpan ~ .,data=logSleep1) plot(LifeSpan1.lm,main="Full model: data=logSleep with outliers removed", col=sleep.color1,pch=19) LifeSpan1.step = step(LifeSpan1.lm,k=log(n1)) plot(LifeSpan1.step,main="Reduced model: data=logSleep with brown bats removed", col=sleep.color1,pch=19) print(summary(LifeSpan1.step))
If the multicollinearity is severe, we can consider principal components regression. This involves obtaining the SVD of the centered data matrix and using the matrix of right singular vectors (eigenvectors of the covariance matrix) to rotate the centered data matrix. Note that these new variables are uncorrelated and are referred to as principal components. If any of the singular values are close to 0, we can reduce dimensions by removing the corresponding principal components and then use the remaining PCs as predictor variables in regression to predict the response variable. This is equivalent to using the Moore-Penrose generalized inverse of to obtain the regression estimates instead of the ordinary inverse.
Example. The MASS library contains a data set named Boston. This data contains 506 observations on 14 variables related to housing prices, one of which is categorical. The last variable, medv gives the median value of owner-occupied homes and will be treated as the response variable. The script below performs some initial pre-processing of variables to remove categorical and nearly categorical variables, applies a log-transformation to crim, and rescales some variables. An initial fit showed that rm has a quadratic relationship with the response, so was added to the data matrix. The initial fit also revealed some outliers which were then removed. The final fit was obtained and backward stepwise selection was performed using the BIC selection criterion.
Next, principal components regression was applied to this data. The details of this process are given here, but most of that is done within the function prcomp. The data is centered by subracting column means, and then the SVD of the centered data matrix is obtained. The matrix of right singular vectors is used to rotate the centered data. Note that the s.d.'s of the principal components are the singular values divided by the square root of the sample size minus one.
library(MASS) Boston1 = Boston[,-c(2,4,12,14)] #remove chas,zn,black,medv Boston1[,1] = log(Boston[,1]) Boston1[,"tax"] = Boston1[,"tax"]/20 #rescale to rate per 200,000 Boston1[,"age"] = Boston1[,"age"]/10 #rescale age Boston1[,"nox"] = Boston1[,"nox"]*10 #rescale nox medianValue = Boston[,"medv"] # histograms of variable 1 indicates log-transform needed #fit linear model MV.lm = lm(medianValue ~ .,data = Boston1) Boston2 = cbind(Boston1,rm2=Boston1[,"rm"]^2) MV2.lm = lm(medianValue ~ .,data = Boston2) bndx = c(365:373) #remove outliers Boston2a = Boston2[-bndx,] medianValue2a = medianValue[-bndx] n = dim(Boston2a) p = dim(Boston2a) MV2a.lm = lm(medianValue2a ~ .,data = Boston2a) summary(MV2a.lm) b.all = coef(MV2a.lm)[-1] MV2a.step = stepAIC(MV2a.lm,k=log(n)) # bic backward stepwise summary(MV2a.step) b.step = rep(0,p) names(b.step) = names(Boston2a) b = coef(MV2a.step)[-1] b.step[names(b)] = b ### PC regression Bc = scale(Boston2a,scale=FALSE) # center only Bc.svd = svd(Bc) d = Bc.svd$d V = Bc.svd$v Bc.cvar = cumsum(d^2)/sum(d^2) plot(Bc.cvar,type="b",ylab="",xlab="# of Principal Components",ylim=c(0.5,1)) title("% of Total Variance Explained vs Number of PCs") Bcr = Bc %*% V #rotated data print(d/sqrt(n-1)) #s.d.'s of PCs Bc.pc = prcomp(Boston2a) #principal components is equivalent to previous ### #fit principal components regression with all PCs Bcr1 = as.data.frame(Bcr) names(Bcr1) = paste("PC",seq(dim(Bcr1)),sep="") MVpc1.lm = lm(medianValue2a ~ .,data=Bcr1) MVpc1.step = stepAIC(MVpc1.lm,k=log(n)) # bic backward selection summary(MVpc1.step) V1names = names(coef(MVpc1.step)[-1]) vndx = match(V1names,names(Bcr1)) V1 = V[,vndx] #obtain columns selected by stepAIC b.pcr = as.vector(V1 %*% coef(MVpc1.step)[-1]) #back rotate PCR coefficients names(b.pcr) = names(Boston2a) rbind(b.all,b.step,b.pcr) #compare coefficients sum(residuals(MV2a.lm)^2) # SSE for full model sum(residuals(MV2a.step)^2) # SSE for reduced model sum(residuals(MVpc1.step)^2) # SSE for reduced PCR model # sum of squared differences between models sum((fitted(MVpc1.step)-fitted(MV2a.lm))^2) sum((fitted(MVpc1.step)-fitted(MV2a.step))^2)