next up previous
Next: Measures of Dispersion Up: Numerical summaries of data Previous: Numerical summaries of data

Measures of Location

We used a histogram to describe the distribution of savings rate and per capita disposable income. Now suppose instead we would like to know where the middle of the savings rate and disposable income is located. This requires that we first define what we mean by the middle of a dataset. There are three such measures in common use: the mean, median, and mode.

The mean usually refers to the arithmetic mean or average. This is just the sum of the measurements divided by the number of measurements. We make a notational distinction between the mean of a population and the mean of a sample. The general rule is that greek letters are used for population characteristics and latin letters are used for sample characteristics. Therefore,

$\displaystyle \mu = \frac{1}{N}\sum_{i=1}^N X_i,
$

denotes the (arithmetic) mean of a population of N observations, and

$\displaystyle \overline{X} = \frac{1}{n}\sum_{i=1}^n X_i,
$

denotes the mean of a sample of size n selected from a population. The mean can be thought of as a center of gravity of the data values. That is, the histogram of the data would balance at the location defined by the mean. We can express this property mathematically by noting that the mean is the solution to the equation,

$\displaystyle \sum_{i=1}^n(X_i - c) = 0.
$

This property of the mean has advantages and disadvantages. The mean is a natural measure of location for data that have a well-defined middle of high concentration with the frequency decreasing more or less evenly as we move away from the middle in either direction. The mean is not as useful when the data is heavily skewed. This is illustrated in the following two histograms. The first is the histogram of savings ratio with its mean superimposed, and the second is the histogram of disposable income.

Image stat3355num1

Image stat3355num2

Another disadvantage of this measure is that it is very sensitive to the presence of a relatively few extreme observations. For example, data in the file
http://www.utdallas.edu/~ammann/stat3355scripts/fuel0.csv
data gives some quantities associated with 60 automobiles. The four plots given below represent histograms of Weight with the mean of Weight superimposed. The second, third, and fourth plots are histograms of Weight with the values 10000, 25000, and 70000, respectively, added to the dataset. The blue line is the original mean and the red lines are the means of the modified data.

Image stat3355num3

An alternative measure of location is the median. This measure is defined to be a number such that half of the measurements are below this number and half are above. The advantage of this measure is that it is not sensitive to the presence of a few outliers. Also, it gives an intuitive description of location regardless of the shape of the histogram. The median is obtained by first ordering the data values from smallest to largest. it the number of observations n is odd, then the median is the ordered value in position (n+1)/2. If n is even, then the median is half-way between the n/2 and n/2 + 1 ordered values.

The plots below are identical to the previous plots except that the median is superimposed in black on each histogram. Note that the location of the median is much more stable than the mean. For that reason the median is used to describe the middle of data such as real estate prices and wages.

Image stat3355num4

The mode is simply the most frequently occurring measurement or category. It is not used much except for some very specialized applications.

R notes:

There is a dataset named state.x77 in R that is a matrix with 50 rows and 8 columns. We can obtain the means for each column using the function colMeans:

state.means = colMeans(state.x77)
This function is a shortcut for:
state.means = apply(state.x77,2,mean)
There also is a vector named state.region giving the geographic region (Northeast, South, North Central, West) for each state. We can use this to extract data for states belonging to a particular region as follows.
NorthEast.x77 = state.x77[state.region == "Northeast",]
South.x77 = state.x77[state.region == "South",]
NorthCentral.x77 = state.x77[state.region == "North Central",]
West.x77 = state.x77[state.region == "West",]
Suppose we wanted to build a matrix that contains the means for each variable within each region so that rows correspond to region and columns correspond to variables. We could accomplish that as follows.
#construct blank matrix with dimnames
Region.means = matrix(0,4,dim(state.x77)[2],
              dimnames=list(levels(state.region),dimnames(state.x77)[[2]]))
Region.means["Northeast",] = colMeans(NorthEast.x77)
Region.means["South",] = colMeans(South.x77)
Region.means["North Central",] = colMeans(NorthCentral.x77)
Region.means["West",] = colMeans(West.x77)
Region.means
round(Region.means,2)

Now suppose we wanted to categorize states by region and by whether or not they are above average in Illiteracy.

table(state.region,state.x77[,"Illiteracy"] > state.means["Illiteracy"])
We can make this frequency table look better by giving more informative names to the Illiteracy columns.
Region = state.region #give state.region a better name
# create logical vector that indicates above average or not
Illiteracy = state.x77[,"Illiteracy"] > state.means["Illiteracy"]
# assign the name Region.Illiteracy to freq table
Region.Illiteracy = table(Region,Illiteracy)
# change col names of this table
dimnames(Region.Illiteracy)[[2]] = c("Below Average","Above Average")
Region.Illiteracy
Another way to do this that gives access to R's object-oriented behavior is to convert the Illiteracy vector to a factor.
Region = state.region #give state.region a better name
# create logical factor that indicates above average or not
Illiteracy = factor(state.x77[,"Illiteracy"] > state.means["Illiteracy"])
# factor function automatically orders the levels alphabetically, so in this case
# levels are FALSE, TRUE
levels(Illiteracy)
# assign new names for these levels
levels(Illiteracy) = c("Below Average","Above Average")
# assign the name Region.Illiteracy to freq table
Region.Illiteracy = table(Region,Illiteracy)
# now we don't need to change col names of this table
Region.Illiteracy
# Plot income vs Illiteracy as a factor instead of a numeric variable
plot(state.x77[,"Income"] ~ Illiteracy,ylab="Income",col=c("cyan")
# add a horizontal line at the overall mean income
abline(h=state.means["Income"])
# add title and sub-title
title("Per Capita Income vs Illiteracy")
title(sub="Horizontal line is at overall mean income")

Note that state.x77 is a matrix, not a data frame.

is.data.frame(state.x77)
# make a data frame from this matrix
State77 = data.frame(state.x77)
# compare the following two plot commands:
plot(Income ~ Illiteracy, data=State77)
plot(State77$Income ~ Illiteracy,ylab="Income")


next up previous
Next: Measures of Dispersion Up: Numerical summaries of data Previous: Numerical summaries of data
ammann
2019-04-17