next up previous
Next: Simulation Project Up: Assignments Previous: Homework 3

Homework 4

1. Data for this problem is in the file
www.utdallas.edu/~ammann/stat6341scripts/Tax.csv
This data contains property tax amounts for a sample of houses along with related physical attributes of the houses. The problem is to understand how taxes are determined from the other variables.
a) Fit a regression model to predict Taxes based on the other variables in this dataset, check assumptions, and make any transformations if needed. Summarize the model and include diagnostic plots to show that assumptions have been verified.
b) Use BIC to reduce the model to just important predictor varibles and provide a summary of the reduced model.
c) Identify any observations that have studentized residuals in the reduced model with absolute value greater than 2. For those observations compare the actual tax to 95% prediction intervals for their taxes and interpret.
d) Let p denote the number of parameters in the reduced model (including the intercept) and let n denote the number of observations. We consider an observation to be influential if

\begin{displaymath}
{\rm dffits} > 2\sqrt{(p+1)/(n-p-1)}
\end{displaymath}

Are any of the observations in c) influential by this definition? Note that such observations would have high residuals and high influence. Remove those observations, refit the model, and then reduce this model using BIC. How does this model differ from the model in part b?
e) Use the model from part d) to obtain 95% prediction intervals for the taxes of the observations that were removed. How do these prediction intervals compare to the ones obtained in part c? How do the actual taxes of the removed observations compare to the new prediction intervals for their taxes?

2. Use data in
http://www.utdallas.edu/~ammann/stat6341scripts/Temperature1.data
This file contains average January minimum temperatures in degrees F. from 1931-1960 for 51 U.S. cities. Pacific coast cities Los Angeles, SanFrancisco, Portland, and Seattle were removed since their winter temperatures are controlled mainly by Pacific ocean currents.
a) Construct an informative plot of temperature versus latitude.
b) Fit a model to predict January minimum temperature based on latitude and longitude. Interpret the coefficients of this model.
c) Are the model assumptions reasonable?
d) Is longitude an important predictor? Use 5% level of significance. If it is not significant, refit the model with just latitude.
e) The latitude of Richardson is 33.0 with a longitude of 96.75. Use your regression model in d) to predict the January minimum temperature for Richardson and obtain a 90% prediction interval for this temperature. Richardson's actual January minimum temperature is 34. How does that compare to temperatures in the prediction interval?
f) How does Richardson's actual January minimum temperature compare to a 90% confidence interval for the mean temperature of all cities at the same latitude?

3. The file
http://www.utdallas.edu/~ammann/stat6341scripts/OgleSMCV.csv
contains stellar magnitudes (luminosity) and log(period) for a family of variable stars called Cepheid variables in the Small Magellenic Cloud. The first column of this file gives IDs for the Cepheids and so can be used as row names. These variable stars are important to astronomers because the periods of their variability (logPeriod) are directly related to their luminosity. This enables astronomers to estimate distances of these stars from their periods. Two types of Cepheids are contained in this data set, FU and FO, and these types have slightly different period-luminosity relationships. Note: stellar magnitudes are reversed in the sense that higher value for magnitude corresponds to a dimmer star. Also, BV = B-V and VI = V-I so those variables should be ignored.
a) Fit a model to predict MV based on I,V,B,logPeriod,Type that includes all two-way interactions between Type and the other variables. Summarize this model and include diagnostic plots to check assumptions.
b) Define as high-residual outliers stars with studentized residuals great than 3 in absolute value, and define as high-leverage outliers stars with

\begin{displaymath}
{\rm dffits} > 2\sqrt{(p+1)/(n-p-1)}
\end{displaymath}

where p is the number of parameters in the model. Remove both high-residual outliers and high-leverage outliers, refit the full model, then reduce the model using BIC. How does this model compare to the original full model?
c) High residual stars and high leverage stars may have been misclassified as FO or FU by the automated photometry software used by this study. For each of those stars use the reduced model to obtain predicted MV based on their values for I,V,B,logPeriod but with Type = FO for all of them. Then repeat but with Type = FU. Obtain the prediction errors using Type = FO and prediction errors using Type = FU. Reclassify these stars according to which type gives the smaller prediction error. Report the results as a table showing the original type and reclassified type of each of these stars. Summarize these results in a two-way frequency table that gives counts of stars according to their original classification and their new classification.


next up previous
Next: Simulation Project Up: Assignments Previous: Homework 3
ammann
2017-12-10