The **mean** usually refers to the arithmetic mean or average. This is
just the sum of the measurements divided by the number of measurements. We make
a notational distinction between the mean of a population and the mean of a
sample. The general rule is that greek letters are used for population
characteristics and latin letters are used for sample characteristics.
Therefore,

denotes the (arithmetic) mean of a population of

denotes the mean of a sample of size

This property of the mean has advantages and disadvantages. The mean is a natural measure of location for data that have a well-defined middle of high concentration with the frequency decreasing more or less evenly as we move away from the middle in either direction. The mean is not as useful when the data is heavily skewed. This is illustrated in the following two histograms. The first is the histogram of savings ratio with its mean superimposed, and the second is the histogram of disposable income.

Another disadvantage of this measure is that it is very sensitive to the presence of a relatively
few extreme observations. For example, data in the file

http://www.utdallas.edu/~ammann/stat3355scripts/fuel0.csv

data gives some quantities associated with 60 automobiles. The four plots given below represent
histograms of **Weight** with the mean of Weight superimposed. The second, third, and fourth
plots are histograms of Weight with the values 10000, 25000, and 70000, respectively, added to the
dataset. The *blue* line is the original mean and the red lines are the means of the modified
data.

An alternative measure of location is the **median**. This measure is
defined to be a number such that half of the measurements are below this
number and half are above. The advantage of this measure is that it is not
sensitive to the presence of a few outliers. Also, it gives an intuitive
description of location regardless of the shape of the histogram. The median
is obtained by first ordering the data values from smallest to largest.
it the number of observations *n* is odd, then the median is the
ordered value in position *(n+1)/2*. If *n* is even, then
the median is half-way between the *n/2* and *n/2 + 1*
ordered values.

The plots below are identical to the previous plots except that the median is superimposed in black on each histogram. Note that the location of the median is much more stable than the mean. For that reason the median is used to describe the middle of data such as real estate prices and wages.

The **mode** is simply the most frequently occurring measurement or
category. It is not used much except for some very specialized applications.

**R notes:**

There is a dataset named *state.x77* in **R** that is a matrix with
50 rows and 8 columns. We can obtain the means for each column using the
function *colMeans*:

state.means = colMeans(state.x77)This function is a shortcut for:

state.means = apply(state.x77,2,mean)There also is a vector named

NorthEast.x77 = state.x77[state.region == "Northeast",] South.x77 = state.x77[state.region == "South",] NorthCentral.x77 = state.x77[state.region == "North Central",] West.x77 = state.x77[state.region == "West",]Suppose we wanted to build a matrix that contains the means for each variable within each region so that rows correspond to region and columns correspond to variables. We could accomplish that as follows.

#construct blank matrix with dimnames Region.means = matrix(0,4,dim(state.x77)[2], dimnames=list(levels(state.region),dimnames(state.x77)[[2]])) Region.means["Northeast",] = colMeans(NorthEast.x77) Region.means["South",] = colMeans(South.x77) Region.means["North Central",] = colMeans(NorthCentral.x77) Region.means["West",] = colMeans(West.x77) Region.means round(Region.means,2)

Now suppose we wanted to categorize states by region and by whether or not they are above average in Illiteracy.

table(state.region,state.x77[,"Illiteracy"] > state.means["Illiteracy"])We can make this frequency table look better by giving more informative names to the Illiteracy columns.

Region = state.region #give state.region a better name # create logical vector that indicates above average or not Illiteracy = state.x77[,"Illiteracy"] > state.means["Illiteracy"] # assign the name Region.Illiteracy to freq table Region.Illiteracy = table(Region,Illiteracy) # change col names of this table dimnames(Region.Illiteracy)[[2]] = c("Below Average","Above Average") Region.IlliteracyAnother way to do this that gives access to R's object-oriented behavior is to convert the Illiteracy vector to a factor.

Region = state.region #give state.region a better name # create logical factor that indicates above average or not Illiteracy = factor(state.x77[,"Illiteracy"] > state.means["Illiteracy"]) # factor function automatically orders the levels alphabetically, so in this case # levels are FALSE, TRUE levels(Illiteracy) # assign new names for these levels levels(Illiteracy) = c("Below Average","Above Average") # assign the name Region.Illiteracy to freq table Region.Illiteracy = table(Region,Illiteracy) # now we don't need to change col names of this table Region.Illiteracy # Plot income vs Illiteracy as a factor instead of a numeric variable plot(state.x77[,"Income"] ~ Illiteracy,ylab="Income",col=c("cyan") # add a horizontal line at the overall mean income abline(h=state.means["Income"]) # add title and sub-title title("Per Capita Income vs Illiteracy") title(sub="Horizontal line is at overall mean income")

Note that *state.x77* is a matrix, not a data frame.

is.data.frame(state.x77) # make a data frame from this matrix State77 = data.frame(state.x77) # compare the following two plot commands: plot(Income ~ Illiteracy, data=State77) plot(State77$Income ~ Illiteracy,ylab="Income")

2019-04-17