Typically, in studies that have large numbers of predictor variables available, many of those
variables may be unrelated to the response among individuals in the population of interest. The
problem then is to identify the important variables and fit a reduced model to predict responses.
This problem is referred to as variable selection. This problem will be discussed here in the
context of prediction of new observations rather than fitting an existing data set. Variable
selection represents an attempt to find an optimal balance between precision and bias. As we have
seen, addition of a variable to a model always reduces residual error variance unless the new
variable is an exact linear combination of the predictor variables already in the model or it has
0 correlation with the residuals of the current model. Even if a variable is generated randomly,
the probability of that happening is essentially 0. The effect of including weakly correlated
variables in a model is increased bias when the model is used to predict responses for new
observations not in the data set used for fitting. We refer to such situations as
**over-fitting**.

If the number of potential predictor variables is small, then models with each possible subset of predictors could be fit and compared. Obviously residual error variance or r-squared should not be the basis for comparison of models since criteria based on those measures would always select the largest model. There are two basic approaches to this problem that are used most often, penalized likelihood methods and shrinkage methods.

Penalized likelihood methods subtract from the maximized likelihood function a quantity that is a
function of the number of variables in the model. These likelihood penalties are designed to adjust
for the increase in bias that would occur if a noise variable is added to the model. One of the
earliest such methods is Mallow's *Cp* statistic, defined by

where is the residual sum of squares, is the number of variables in the model, and is a low-bias estimate of residual error variance that does not depend on . Typically, the low-bias estimate of residual error variance is obtained from the largest possible model. Note that Mallow's definition used instead of .) Since is constant wrt , this is equivalent to

(1) |

An information-based penalty was developed by Aikake and is referred to as Aikake's Information Criterion (AIC). A similar measure introduced by Schwartz is referred to as Bayes Information Criterion (BIC). These are defined for linear regression by

Addition of a variable decreases RSS but increases the penalty. The best model is the one with the smallest value of the criterion. Note that the dimension penalties in these criteria do not include precision associated with the model being evaluated, nor do they include model bias.

When there are more than a few potential predictor variables, it is most efficient to use a forward stepwise approach to the selection of variables. The variable most strongly correlated with the response is selected initially and a linear model is fit. At each step the next variable selected from the remaining variables is the one most strongly correlated with the residuals of the current fit. This is continued until all variables have been added to the model or a predefiined stopping criterion has been satisfied. The selection criterion (Cp, AIC, or BIC) is evaluated at each step and the model selected is the one with minimum value of the criterion.

This process is performed in **R** with the *step()* function. By default this function
performs forward stepwise regression using AIC for the selection criterion. Steps are terminated
when AIC values of all remaining variables are higher than AIC of the current model. Choice of BIC
criterion is made by the argument `k=log(n)` where *n* is the sample size. This
function is implemented by updating the QR decomposition and so has computational complexity that
is the same order of magnitude as the complexity of obtaining the QRD using all predictors.

Shrinkage methods subtract from the likelihood function a penalty that is proportional to a norm
of the coefficients. For standard linear model assumptions the maximized likelihood is the sum of
squared residuals and so the goal is to minimize

where

Recall that reduction in RSS when a variable *a* is added to a model is given by

where

Note that

Big Data problems contain large numbers of potential predictor variables and so variable selection is an integral step in the analysis of such data. For example, identification of genetic biomarkers with genomic data may lead to new drugs and treatments as well as better understanding of disease mechanisms. However, genomic data sets often contain tens of thousands of variables in the form of gene expressions. Further exacerbating that problem is the much smaller sample sizes typically used for such studies. Classical methods for variable selection almost always leads to over-fitting the data by selecting variables that may appear useful for the sample in the study, but they only add bias to prediction of responses from new individuals from the population. For these reasons, variable selection for Big Data remains an important topic for future research.

2017-11-01