Due date: Oct. 1, 2015. Don't just give answers to these problems. Consider them as if they were job assignments given to you by your supervisor and add some brief discussion and interpretation of the results. Create a document that contains your answers and graphics and email it to me with the subject line: Stat 3355 homework 1. If you use Word or some other document program, please save a copy as a pdf file and send that rather than the original Word document.

- Use the data contained in the file

http://www.utdallas.edu/~ammann/stat3355scripts/Smoking.txt

- Find the means and standard deviations for each variable.
- Which states are more than 2 sd's above the mean for cigarette consumption? for bladder cancer? for lung cancer?
- Which states are in the top 10% of cigarette consumption? of bladder
cancer? of lung cancer? (see documentation for
**R**function*quantile()*) - Plot cigarette consumption versus lung cancer and add an informative title.
- Repeat for bladder cancer.

- Use the data contained in the file

http://www.utdallas.edu/~ammann/stat3355scripts/Sleep.data

A description of this data is given in

http://www.utdallas.edu/~ammann/stat3355scripts/Sleep.txt

The`Species`column should be used as row names.- Construct histograms of each variable.
- The strong asymmetry for all variables except
`Sleep`indicates that a*log*transformation is appropriate for those variables. Construct a new data frame that contains`Sleep`, replaces*BodyWgt, BrainWgt, LifeSpan*by their log-transformed values, and then construct histograms of each variable in this new data frame. - Plot
`LifeSpan`vs`BrainWgt`with`LifeSpan`on the y-axis. Repeat using these variables after applying a log-transformation to both variables. Superimpose lines corresponding to the respective means of the variables for each plot. - What proportion of species are within 2 s.d.'s of mean
`LifeSpan`? What proportion are with 2 s.d.'s of mean`BrainWgt`? Answer these for the original variables and for the log-transformed variables. - Obtain the correlation between
*LifeSpan*and*BrainWgt*. Repeat for*Log(LifeSpan)*and*log(BrainWgt)*. Interpret these correlations. - Obtain the least squares regression line to predict
*LifeSpan*based on*BrainWgt*. Repeat to predict*log(LifeSpan)*based on*log(BrainWgt)*. Predict*LifeSpan*of Homo sapiens based on each of these regression lines. Which would you expect to have the best overall accuracy? Which prediction is closest to the actual*LifeSpan*of Homo sapiens?

- Use the data contained in the file

http://www.utdallas.edu/~ammann/stat3355scripts/HappyPlanet.csv

This data comes from the*Happy Planet Index*, http://www.happyplanetindex.org

Note that one of the countries is`Cote d'Ivoire`

which requires use of the`quote=`argument in`read.table()`:quote="\""

- Obtain the quartiles of LifeExpectancy.
- Construct a histogram of GDP. Obtain the mean and s.d. of GDP. How many countries are within 2 s.d.'s of the mean GDP?
- Since GDP is heavily skewed, construct a new variable called
**logGDP**which is the logarithm of GDP. Answer the previous two items for this variable. Are the quartiles of**logGDP**the same as the logarithm of the quartiles of GDP? What about the mean? - The SubRegion variable in this data set represents both region and sub-region. Region is the
first character and sub-region is the second character. If this data has been read into a data
frame named
*HappyPlanet*, then region can be extracted using the**substring()**function. We want these numeric codes to be treated as categories, not numbers, so we can convert the result of this operation to a factor.Region = substring(HappyPlanet[,"SubRegion"],1,1) Region = factor(Region)

Plot LifeExpectancy vs logGDP, use different colors for different regions, include an informative title, and include a legend that indicates which color corresponds to which region. - Find the correlation between
**LifeExpectancy, logGDP**and interpret. - Repeat the previous two items for
**HappyLifeYears**.

2015-09-22