Next: Homework 4 Up: Assignments Previous: Homework 2

## Homework 3

Due date: Oct. 26, 2017

1. Use data in the file
http://www.utdallas.edu/~ammann/stat6341scripts/cars.csv
The first column is not data, it should be used for row names. The column labelled origin is coded as follows: 1=US, 2=Europe, 3=Asia. That column should be converted to a factor with levels US, Europe, Asia. The goal here is to construct a model to predict mpg based on displacement, horsepower, weight, acceleration, origin.
[a] Obtain the least squares model and check assumptions.
[b] Use forward stepwise regression with the BIC criterion to select the most important variables for prediction.
[c] Summarize the properties of the final model and interpret its coefficients.
[d] There are several diesel fueled cars in this data set. Obtain hat values for these cars and discuss their influence on the least squares fit.
[e] Create a new data frame with the diesel cars removed and repeat parts [a,b] for this reduced data set. How do the coefficients differ between the final model with all cars and the final model after removing diesel cars?
Note: see R functions step.lm(), influence.measures(), grep.

2. Use data in the file
http://www.utdallas.edu/~ammann/stat6341scripts/Smoking.data
This dataset gives cigarette consumption and the rates of several types of cancer. The goal is to determine if there is a relationship between cigarette consumption and cancer rate. The variables are:
```STATE: state
CIG: cigarette consumption
BLAD: bladder cancer
LUNG: lung cancer
KID: kidney cancer
LEUK: leukemia
```
Note that STATE should be used as row names, not as a variable. However, there is a built-in data set in R, state.region, that categorizes states into four regions, Northeast, South, North Central, West. Use this variable as an additional factor. Since the Smoking data includes DC but state.region does not, assign the region for DC to be South since both Maryland and Virginia are included in that region. This can be done by a lookup table. Note that state.region is a factor, so to add an entry for DC, we first must convert state.region to an ordinary character vector and then combine that vector with the region for DC. Then this new vector must be converted to a factor when it is added to the Smoking data frame.
```Region = c(as.vector(state.region),"South")
names(Region) = c(state.abb,"DC")
Smoking\$Region = factor(Region[dimnames(Smoking)[[1]]])
```
[a] Fit models to predict bladder cancer rate based on cigarette consumption and Region. Consider three models: CIG only, CIG+Region, CIG*Region. Use 5% level of significance for partial-F tests to determine which model to use.
[b] Check assumptions of the final regression model.
[c] Construct a plot of bladder cancer rate vs CIG. Superimpose lines representing fitted values. If Region is in the model, then use different colors for the different regions and include a legend.

3. The file
http://www.utdallas.edu/~ammann/stat6341scripts/DiabetesFull.csv
contains data from a diabetes study. The response variable Y is the last column of this data set. The other variables are potential predictor variables. The goal here is to compare the mean square prediction error of several potential least squares models.
Model 1: use all variables to predict Y.
Model 2: select variables using backward stepwise selection with AIC criterion.
Model 3: select variables using backward stepwise selection with BIC criterion.
Model 4: select variables using forward stepwise selection with AIC criterion. Start the selection process with the intercept-only model.
Model 5: select variables using forward stepwise selection with BIC criterion. Start the selection process with the intercept-only model.
Estimate mean squared prediction error for each model as follows. Treat the first 300 observations as training data and fit the models using just the training data. Use the remaining 142 observations as test data. Obtain predicted values from each model for the test data and then obtain mean squared prediction errors for each model. Discuss the results.
Note. Backward stepwise variable selection is performed in R with the function step(full.lm), where full.lm is the model with all potential predictors included.The code below illustrates how to perform forward stepwise selection. In this code X is a data frame that contains all of the predictor variables.
```Y0.lm = lm(Y ~ 1, data=X)
Yall.lm = lm(Y ~ ., data=X)
Ystep.lm = step(Y0.lm,direction="forward",scope=list(lower=Y0.lm, upper=Yall.lm))
```
The default value for argument k in this function is 2 which corresponds to AIC selection criterion. To use BIC, the argument k=log(n) must be included. Note that the argument data=X must be used for the intercept-only model even though the predictor variables are not used in that model. Both models in the scope argument must use the same data frame.

Next: Homework 4 Up: Assignments Previous: Homework 2
ammann
2017-12-10