next up previous
Next: Numerical summaries of data Up: Graphical tools Previous: R functions


Some of the functions used in this section are described below.

read.table(). If the data set for a project is not small, it is most convenient to enter the data into R from a tabular data file in which each row corresponds to an individual and columns contain various measurements associated with each individual. These files must be plain text (not created by a document processor such as Word). If the data comes from a database or spreadsheet, the simplest way to have R read the data is to have the database or spreadsheet export the data into a comma-separated values file (csv). An example is given by the file

The first argument is the name of the data file. This must be a string that contains the full path to the file if it is not in the startup directory, or it may be an internet address if the file is on a remote server.
The first row of the crabs.csv file contains names for the columns. This row is referred to as a header and requires use of the
The values in each row are separated by a comma. The default separator is white space, so the argument
is needed for the crabs data file.
read.table() will return an error message if it finds that the rows don't all contain the same number of values. This can occur, for example, if a csv file was created from an Excel file that had some extraneous blank cells. Otherwise, read.table() returns a data frame that is assigned to the name Crabs.

Note that the first two columns, named Species and Gender, respectively, are strings, not numeric values. In such cases, read.table() assumes these are categorical variables and then converts each of them automatically to a factor. The unique values of a factor are referred to as its levels. The levels of Species are B,O (for blue and orange), and the levels of Gender are M,F.

A particular column of a data frame can be accessed by name of the data frame followed by a dollar sign followed by the name of the column. So, for example,

refers to the column with that name. You could obtain a histogram of that column by

Smoke-Cancer data

The file
contains data related to smoking and cancer rates by state for 2010.

  1. Import this data into R using the first column as row names. This requires adding the argument,
    within read.table().
  2. Create a new data frame that contains the following variables:
    CigSalesRate = FY2010 Sales per 100,000 population
  3. Create all pairwise plots of the variables in this data frame. Add an informative main title and note on the plot that the data includes all states and D.C. Used filled circles for the plot character.
  4. Create a new plot of LungCancerRate vs CigSalesRate with informative title. Note on the plot that CigSalesRate is cigarette sales per 100,000 population.
  5. Repeat this plot but now use red for Texas, black for others, and add the text TX next to the point corresponding to Texas in this plot.
  6. Repeat previous plot but use CigYouthRate instead of CigSalesRate.
  7. Repeat but use CigAdultRate instead of CigYouthRate.
  8. Once you are happy with how these plots look, save them in a pdf document.

(Advanced) Now create a plot of LungCancerRate vs CigAdultRatem but color the points differently by region, and include a legend which shows to which regioin each color is assigned. This results in several problems that must be overcome. Region can be obtained from the R data set named state which includes an object named state.region. The first problem is that DC is not included, so we need to create a vector that combines DC with the values of state.region. This adds an additional problem because state.region is not an ordinary vector, but instead is a factor.

A factor is how R handles categorical variables. state.region represents a vector of 50 strings each of which is one of 4 unique values: Northeast, South, North Central, West. These unique values are referred to as the levels of the factor. Internally, R stores the values of a factor using the level index number instead of the string. For example, the region for Alabama is South so what is stored internally is the index 2 correspoinding to the index of Alabama's region within the levels of this factor. Ordinary printing of a factor is equivalent to:

The difference between an ordinary vector and a factor can be illustrated by comparing the result of the following:

For our problem we first must convert state.region to an ordinary character vector, then add the region for DC. Since both Maryland and Virginia have region defined as South, we will use that for DC as well. This can be done by

Region = c(as.character(state.region),"South")

Next we need to put that vector in the same order as the order used for the rows of the data frame that contains the smoke-cancer data. Note that the Smoke data set is ordered by the full state names, not their 2-letter abbreviations. Finally, we need to create a vector of colors in which states in each region are plotted using a unique color for that region. One way of doing these last two steps is by a table lookup.

A script for this example is contained in the file:

Crabs data

Some of the graphical tools available in R are illustrated in the script file

next up previous
Next: Numerical summaries of data Up: Graphical tools Previous: R functions
Larry Ammann