If any of the predictor variables are factors (categorical variables), then **R** adds columns
of dummy variables according to how those factors appear in the model formula before the linear
model is fit. If all predictors are factors, then the linear model corresponds to analysis of
variance. In such cases the fit should be obtained using the *aov()* function instead of
*ls()*. There is no difference computationally, but summary and plot methods for *aov*
objects differ from those methods for *lm* objects.

It is important to understand how dummy variables are coded in **R** in order to understand
how to interpret coefficient estimates. By default, dummy variables are created using treatment
contrasts. In this case the first level of a factor is treated as if it is the control group.
Coefficient estimates are reported for each of the other levels and these represent estimates of
the mean deviations from the control group. Note that this coding only affects how coefficients
are interpreted; it has no affect on p-values of hypothesis tests.

If the model includes a actor variable and a numeric variable, then we may wish to include an
interaction term. Interaction between a factor and a numeric variable allows for different slopes
in the relationship between the response and the numeric variable. For example, the crabs data set
includes measurements of several physical attributes of crabs from two different species and from
both sexes. For purposes of illustration, the Species and Sex factors will be combined into a
single factor with four levels: *BM,BF,OM,OF*.

crabs = read.table("stat6341scripts/crabs.csv",header=TRUE,sep=",") SpSex = paste(crabs$Species,crabs$Sex,sep="") SpSex = factor(SpSex) #fit model to predict CL based on FL and SpSex with interaction CL.lm = lm(CL ~ SpSex*FL,data=crabs) plot(CL.lm) #looks ok summary(CL.lm)

The interaction terms are shown in the matrix of coefficients by a : that separates the name of
the numeric variable and the levels of the factor. Note that the first factor, *BF*, does
not appear in that list. Since the default treatment contrasts were used for coding, the estimates
for *(Intercept)* and *FL* represent the estimated intercept and slope for the
*BF* group. The estimate for *SpSexBM* represents the estimated difference between
the intercepts for *BF* and *BM*. The estimate for *SpSexBM:FL* represents the
estimated difference between the slopes for *BF* and *BM*. A partial F-test for
significance of the interaction term in this model can be obtained by fitting an additive model
and using the *anova()* function.

CL.lm1 = lm(CL ~ SpSex+FL,data=crabs) anova(CL.lm1,CL.lm)

The additive model assumes that the slopes for the levels of the factor are the same but the
intercepts may differ. In this case the interaction term is highly significant. This indicates that
the slopes in the linear relationships between the numeric predictor and the response variable
differ significantly among the levels of the factor. If a factor*numeric interaction term is not
significant, then we may wish to perform a partial F-test to determine whether or not there is a
difference among the intercepts. This can be done by fitting a model without the factor and then
using the *anova()* function.

CL.lm0 = lm(CL ~ FL,data=crabs) anova(CL.lm0,CL.lm1,CL.lm)Note that we only should consider this test if the interaction term is not significant.

The three models in this example represent nested models applied to the same data and the
*anova()* function reports the results of the corresponding partial F-tests. A similar
situation occurs if we are considering polynomial regression models. Suppose for example *X*
is a numeric predictor and we wish to compare linear, quadratic, and cubic models.

X2 = X^2 X3 = X^3 Y.lm0 = lm(Y ~ 1) #intercept only Y.lm1 = lm(Y ~ X) #linear Y.lm2 = lm(Y ~ X + X2) #quadratic Y.lm3 = lm(Y ~ X + X2 + X3) #cubic anova(Y.lm3,Y.lm2,Y.lm1,Y.lm0)Of course, we also should examine diagnostic plots of the final model.

2017-11-01