**Due date**: Oct. 26, 2017

- Use data in the file

http://www.utdallas.edu/~ammann/stat6341scripts/cars.csv

The first column is not data, it should be used for row names. The column labelled*origin*is coded as follows:*1=US, 2=Europe, 3=Asia*. That column should be converted to a factor with levels*US, Europe, Asia*. The goal here is to construct a model to predict**mpg**based on*displacement, horsepower, weight, acceleration, origin*.**[a]**Obtain the least squares model and check assumptions.**[b]**Use forward stepwise regression with the BIC criterion to select the most important variables for prediction.**[c]**Summarize the properties of the final model and interpret its coefficients.**[d]**There are several diesel fueled cars in this data set. Obtain hat values for these cars and discuss their influence on the least squares fit.**[e]**Create a new data frame with the diesel cars removed and repeat parts**[a,b]**for this reduced data set. How do the coefficients differ between the final model with all cars and the final model after removing diesel cars?**Note**: see**R**functions`step.lm()`,`influence.measures()`,`grep`. - Use data in the file

http://www.utdallas.edu/~ammann/stat6341scripts/Smoking.data

This dataset gives cigarette consumption and the rates of several types of cancer. The goal is to determine if there is a relationship between cigarette consumption and cancer rate. The variables are:STATE: state CIG: cigarette consumption BLAD: bladder cancer LUNG: lung cancer KID: kidney cancer LEUK: leukemia

Note that STATE should be used as row names, not as a variable. However, there is a built-in data set in**R**,`state.region`, that categorizes states into four regions,*Northeast, South, North Central, West*. Use this variable as an additional factor. Since the Smoking data includes DC but`state.region`does not, assign the region for DC to be South since both Maryland and Virginia are included in that region. This can be done by a lookup table. Note that`state.region`is a factor, so to add an entry for DC, we first must convert`state.region`to an ordinary character vector and then combine that vector with the region for DC. Then this new vector must be converted to a factor when it is added to the Smoking data frame.Region = c(as.vector(state.region),"South") names(Region) = c(state.abb,"DC") Smoking$Region = factor(Region[dimnames(Smoking)[[1]]])

**[a]**Fit models to predict**bladder**cancer rate based on cigarette consumption and Region. Consider three models: CIG only,`CIG+Region`,`CIG*Region`. Use 5% level of significance for partial-F tests to determine which model to use.**[b]**Check assumptions of the final regression model.**[c]**Construct a plot of**bladder**cancer rate vs**CIG**. Superimpose lines representing fitted values. If*Region*is in the model, then use different colors for the different regions and include a legend. - The file

http://www.utdallas.edu/~ammann/stat6341scripts/DiabetesFull.csv

contains data from a diabetes study. The response variable Y is the last column of this data set. The other variables are potential predictor variables. The goal here is to compare the mean square prediction error of several potential least squares models.

Model 1: use all variables to predict Y.

Model 2: select variables using backward stepwise selection with AIC criterion.

Model 3: select variables using backward stepwise selection with BIC criterion.

Model 4: select variables using forward stepwise selection with AIC criterion. Start the selection process with the intercept-only model.

Model 5: select variables using forward stepwise selection with BIC criterion. Start the selection process with the intercept-only model.

Estimate mean squared prediction error for each model as follows. Treat the first 300 observations as training data and fit the models using just the training data. Use the remaining 142 observations as test data. Obtain predicted values from each model for the test data and then obtain mean squared prediction errors for each model. Discuss the results.**Note**. Backward stepwise variable selection is performed in**R**with the function`step(full.lm)`, where`full.lm`is the model with all potential predictors included.The code below illustrates how to perform forward stepwise selection. In this code`X`is a data frame that contains all of the predictor variables.Y0.lm = lm(Y ~ 1, data=X) Yall.lm = lm(Y ~ ., data=X) Ystep.lm = step(Y0.lm,direction="forward",scope=list(lower=Y0.lm, upper=Yall.lm))

The default value for argument`k`in this function is 2 which corresponds to AIC selection criterion. To use BIC, the argument`k=log(n)`must be included. Note that the argument`data=X`must be used for the intercept-only model even though the predictor variables are not used in that model. Both models in the scope argument must use the same data frame.

2017-12-10