If any of the predictor variables are factors (categorical variables), then R adds columns of dummy variables according to how those factors appear in the model formula before the linear model is fit. If all predictors are factors, then the linear model corresponds to analysis of variance. In such cases the fit should be obtained using the aov() function instead of ls(). There is no difference computationally, but summary and plot methods for aov objects differ from those methods for lm objects.
It is important to understand how dummy variables are coded in R in order to understand how to interpret coefficient estimates. By default, dummy variables are created using treatment contrasts. In this case the first level of a factor is treated as if it is the control group. Coefficient estimates are reported for each of the other levels and these represent estimates of the mean deviations from the control group. Note that this coding only affects how coefficients are interpreted; it has no affect on p-values of hypothesis tests.
If the model includes a actor variable and a numeric variable, then we may wish to include an interaction term. Interaction between a factor and a numeric variable allows for different slopes in the relationship between the response and the numeric variable. For example, the crabs data set includes measurements of several physical attributes of crabs from two different species and from both sexes. For purposes of illustration, the Species and Sex factors will be combined into a single factor with four levels: BM,BF,OM,OF.
crabs = read.table("stat6341scripts/crabs.csv",header=TRUE,sep=",") SpSex = paste(crabs$Species,crabs$Sex,sep="") SpSex = factor(SpSex) #fit model to predict CL based on FL and SpSex with interaction CL.lm = lm(CL ~ SpSex*FL,data=crabs) plot(CL.lm) #looks ok summary(CL.lm)
The interaction terms are shown in the matrix of coefficients by a : that separates the name of the numeric variable and the levels of the factor. Note that the first factor, BF, does not appear in that list. Since the default treatment contrasts were used for coding, the estimates for (Intercept) and FL represent the estimated intercept and slope for the BF group. The estimate for SpSexBM represents the estimated difference between the intercepts for BF and BM. The estimate for SpSexBM:FL represents the estimated difference between the slopes for BF and BM. A partial F-test for significance of the interaction term in this model can be obtained by fitting an additive model and using the anova() function.
CL.lm1 = lm(CL ~ SpSex+FL,data=crabs) anova(CL.lm1,CL.lm)
The additive model assumes that the slopes for the levels of the factor are the same but the intercepts may differ. In this case the interaction term is highly significant. This indicates that the slopes in the linear relationships between the numeric predictor and the response variable differ significantly among the levels of the factor. If a factor*numeric interaction term is not significant, then we may wish to perform a partial F-test to determine whether or not there is a difference among the intercepts. This can be done by fitting a model without the factor and then using the anova() function.
CL.lm0 = lm(CL ~ FL,data=crabs) anova(CL.lm0,CL.lm1,CL.lm)Note that we only should consider this test if the interaction term is not significant.
The three models in this example represent nested models applied to the same data and the anova() function reports the results of the corresponding partial F-tests. A similar situation occurs if we are considering polynomial regression models. Suppose for example X is a numeric predictor and we wish to compare linear, quadratic, and cubic models.
X2 = X^2 X3 = X^3 Y.lm0 = lm(Y ~ 1) #intercept only Y.lm1 = lm(Y ~ X) #linear Y.lm2 = lm(Y ~ X + X2) #quadratic Y.lm3 = lm(Y ~ X + X2 + X3) #cubic anova(Y.lm3,Y.lm2,Y.lm1,Y.lm0)Of course, we also should examine diagnostic plots of the final model.