Typically, in studies that have large numbers of predictor variables available, many of those variables may be unrelated to the response among individuals in the population of interest. The problem then is to identify the important variables and fit a reduced model to predict responses. This problem is referred to as variable selection. This problem will be discussed here in the context of prediction of new observations rather than fitting an existing data set. Variable selection represents an attempt to find an optimal balance between precision and bias. As we have seen, addition of a variable to a model always reduces residual error variance unless the new variable is an exact linear combination of the predictor variables already in the model or it has 0 correlation with the residuals of the current model. Even if a variable is generated randomly, the probability of that happening is essentially 0. The effect of including weakly correlated variables in a model is increased bias when the model is used to predict responses for new observations not in the data set used for fitting. We refer to such situations as over-fitting.
If the number of potential predictor variables is small, then models with each possible subset of predictors could be fit and compared. Obviously residual error variance or r-squared should not be the basis for comparison of models since criteria based on those measures would always select the largest model. There are two basic approaches to this problem that are used most often, penalized likelihood methods and shrinkage methods.
Penalized likelihood methods subtract from the maximized likelihood function a quantity that is a
function of the number of variables in the model. These likelihood penalties are designed to adjust
for the increase in bias that would occur if a noise variable is added to the model. One of the
earliest such methods is Mallow's Cp statistic, defined by
An information-based penalty was developed by Aikake and is referred to as Aikake's Information Criterion (AIC). A similar measure introduced by Schwartz is referred to as Bayes Information Criterion (BIC). These are defined for linear regression by
When there are more than a few potential predictor variables, it is most efficient to use a forward stepwise approach to the selection of variables. The variable most strongly correlated with the response is selected initially and a linear model is fit. At each step the next variable selected from the remaining variables is the one most strongly correlated with the residuals of the current fit. This is continued until all variables have been added to the model or a predefiined stopping criterion has been satisfied. The selection criterion (Cp, AIC, or BIC) is evaluated at each step and the model selected is the one with minimum value of the criterion.
This process is performed in R with the step() function. By default this function performs forward stepwise regression using AIC for the selection criterion. Steps are terminated when AIC values of all remaining variables are higher than AIC of the current model. Choice of BIC criterion is made by the argument k=log(n) where n is the sample size. This function is implemented by updating the QR decomposition and so has computational complexity that is the same order of magnitude as the complexity of obtaining the QRD using all predictors.
Shrinkage methods subtract from the likelihood function a penalty that is proportional to a norm
of the coefficients. For standard linear model assumptions the maximized likelihood is the sum of
squared residuals and so the goal is to minimize
Recall that reduction in RSS when a variable a is added to a model is given by
Big Data problems contain large numbers of potential predictor variables and so variable selection is an integral step in the analysis of such data. For example, identification of genetic biomarkers with genomic data may lead to new drugs and treatments as well as better understanding of disease mechanisms. However, genomic data sets often contain tens of thousands of variables in the form of gene expressions. Further exacerbating that problem is the much smaller sample sizes typically used for such studies. Classical methods for variable selection almost always leads to over-fitting the data by selecting variables that may appear useful for the sample in the study, but they only add bias to prediction of responses from new individuals from the population. For these reasons, variable selection for Big Data remains an important topic for future research.