next up previous
Next: Multicollinearity and Principal Components Up: Linear Models in R Previous: Factor variables in linear

Variable selection

Typically, in studies that have large numbers of predictor variables available, many of those variables may be unrelated to the response among individuals in the population of interest. The problem then is to identify the important variables and fit a reduced model to predict responses. This problem is referred to as variable selection. This problem will be discussed here in the context of prediction of new observations rather than fitting an existing data set. Variable selection represents an attempt to find an optimal balance between precision and bias. As we have seen, addition of a variable to a model always reduces residual error variance unless the new variable is an exact linear combination of the predictor variables already in the model or it has 0 correlation with the residuals of the current model. Even if a variable is generated randomly, the probability of that happening is essentially 0. The effect of including weakly correlated variables in a model is increased bias when the model is used to predict responses for new observations not in the data set used for fitting. We refer to such situations as over-fitting.

If the number of potential predictor variables is small, then models with each possible subset of predictors could be fit and compared. Obviously residual error variance or r-squared should not be the basis for comparison of models since criteria based on those measures would always select the largest model. There are two basic approaches to this problem that are used most often, penalized likelihood methods and shrinkage methods.

Penalized likelihood methods subtract from the maximized likelihood function a quantity that is a function of the number of variables in the model. These likelihood penalties are designed to adjust for the increase in bias that would occur if a noise variable is added to the model. One of the earliest such methods is Mallow's Cp statistic, defined by

\begin{displaymath}
Cp_k = \frac{RSS_k}{s^2_e} + 2k,
\end{displaymath}

where $RSS_k$ is the residual sum of squares, $k$ is the number of variables in the model, and $s^2_e$ is a low-bias estimate of residual error variance that does not depend on $k$. Typically, the low-bias estimate of residual error variance is obtained from the largest possible model. Note that Mallow's definition used $2k-n$ instead of $2k$.) Since $s^2_e$ is constant wrt $k$, this is equivalent to
\begin{displaymath}
Cp^*_k = RSS_k + 2ks^2_e.
\end{displaymath} (1)

An information-based penalty was developed by Aikake and is referred to as Aikake's Information Criterion (AIC). A similar measure introduced by Schwartz is referred to as Bayes Information Criterion (BIC). These are defined for linear regression by

\begin{eqnarray*}
AIC_k &=& n\,log(RSS_k/n) + 2k,\\
BIC_k &=& n\,log(RSS_k/n) + k\,log(n).\\
\end{eqnarray*}

Addition of a variable decreases RSS but increases the penalty. The best model is the one with the smallest value of the criterion. Note that the dimension penalties in these criteria do not include precision associated with the model being evaluated, nor do they include model bias.

When there are more than a few potential predictor variables, it is most efficient to use a forward stepwise approach to the selection of variables. The variable most strongly correlated with the response is selected initially and a linear model is fit. At each step the next variable selected from the remaining variables is the one most strongly correlated with the residuals of the current fit. This is continued until all variables have been added to the model or a predefiined stopping criterion has been satisfied. The selection criterion (Cp, AIC, or BIC) is evaluated at each step and the model selected is the one with minimum value of the criterion.

This process is performed in R with the step() function. By default this function performs forward stepwise regression using AIC for the selection criterion. Steps are terminated when AIC values of all remaining variables are higher than AIC of the current model. Choice of BIC criterion is made by the argument k=log(n) where n is the sample size. This function is implemented by updating the QR decomposition and so has computational complexity that is the same order of magnitude as the complexity of obtaining the QRD using all predictors.

Shrinkage methods subtract from the likelihood function a penalty that is proportional to a norm of the coefficients. For standard linear model assumptions the maximized likelihood is the sum of squared residuals and so the goal is to minimize

\begin{displaymath}
\sum e_i^2 - c\Vert\beta\Vert,
\end{displaymath}

where c>0 is a tuning parameter. The idea here is that larger models would have larger values for the norm of the coefficients, so the reduction in RSS associated with a larger model would need to be high enough to offset the increased norm of its coefficients. Coefficients that are essentially 0 in these methods are removed from the model. If the 2-norm is used here, the method is referred to as ridge regression, but this does not ordinarily result in reduction of the number of variables. Use of the 1-norm almost always results in removal of weak variables. This method is referred to as the lasso.

Recall that reduction in RSS when a variable a is added to a model is given by

\begin{displaymath}
RSS_k - RSS_{k+1} = u^2,
\end{displaymath}

where

\begin{displaymath}
u = \frac{d^TY}{\Vert d\Vert},\ \ d = (I-QQ^T)a.
\end{displaymath}

Note that d is the projection of a onto the orthogonal complement of range(X). This projection removes all of the partial correlations with variables already in the model between a and Y. In practice removal of all of those partial correlations may cause the stepwise process to follow a sub-optimal path. An alternative algorithem can be defined by taking only a very small step in the direction of the projection onto the orthogonal complement of range(X). This algorithm is referred to as stagewise regression. An efficient implementation of forward stagewise regression, referred to as Least Angle Regression (LARS) is available in the contributed package lars. The authors of that package show that LARS is related to lasso variable selection. This package includes options for forward stagewise, LARS, and lasso. Mallow's Cp statistic is used for variable selection. The lars package is not distributed with R. It must be downloaded and installed from the CRAN site via the Package Installer menu item for R.

Big Data problems contain large numbers of potential predictor variables and so variable selection is an integral step in the analysis of such data. For example, identification of genetic biomarkers with genomic data may lead to new drugs and treatments as well as better understanding of disease mechanisms. However, genomic data sets often contain tens of thousands of variables in the form of gene expressions. Further exacerbating that problem is the much smaller sample sizes typically used for such studies. Classical methods for variable selection almost always leads to over-fitting the data by selecting variables that may appear useful for the sample in the study, but they only add bias to prediction of responses from new individuals from the population. For these reasons, variable selection for Big Data remains an important topic for future research.


next up previous
Next: Multicollinearity and Principal Components Up: Linear Models in R Previous: Factor variables in linear
Larry Ammann
2017-11-01