The file
http://www.UTDallas.edu/~ammann/SmokeCancer.csv
contains data related to smoking and cancer rates by state for 2010.
(Advanced) Now create a plot of LungCancerRate vs CigAdultRatem but color the points differently by region, and include a legend which shows to which regioin each color is assigned. This results in several problems that must be overcome. Region can be obtained from the R data set named state which includes an object named state.region. The first problem is that DC is not included, so we need to create a vector that combines DC with the values of state.region. This adds an additional problem because state.region is not an ordinary vector, but instead is a factor.
A factor is how R handles categorical variables. state.region represents a vector of 50 strings each of which is one of 4 unique values: Northeast, South, North Central, West. These unique values are referred to as the levels of the factor. Internally, R stores the values of a factor using the level index number instead of the string. For example, the region for Alabama is South so what is stored internally is the index 2 correspoinding to the index of Alabama's region within the levels of this factor. Ordinary printing of a factor is equivalent to:
levels(state.region)[state.region]The difference between an ordinary vector and a factor can be illustrated by comparing the result of the following:
c(state.region,"South") c(as.vector(state.region),"South")
For our problem we first must convert state.region to an ordinary character vector, then add the region
for DC. Since both Maryland and Virginia have region defined as South, we will use that for DC as
well. This can be done by
Region = c(as.character(state.region),"South")
Next we need to put that vector in the same order as the order used for the rows of the data frame that contains the smoke-cancer data. Note that the Smoke data set is ordered by the full state names, not their 2-letter abbreviations. Finally, we need to create a vector of colors in which states in each region are plotted using a unique color for that region. One way of doing these last two steps is by a table lookup.
A script for this example is contained in the file:
http://www.UTDallas.edu/~ammann/stat3355scripts/stat3355example1c.r