next up previous
Next: Measures of Dispersion Up: Numerical summaries of data Previous: Numerical summaries of data

Measures of Location

We used a histogram to describe the distribution of savings rate and per capita disposable income. Now suppose instead we would like to know where the middle of the savings rate and disposable income is located. This requires that we first define what we mean by the middle of a dataset. There are three such measures in common use: the mean, median, and mode.

The mean usually refers to the arithmetic mean or average. This is just the sum of the measurements divided by the number of measurements. We make a notational distinction between the mean of a population and the mean of a sample. The general rule is that greek letters are used for population characteristics and latin letters are used for sample characteristics. Therefore,

\begin{displaymath}
\mu = \frac{1}{N}\sum_{i=1}^N X_i,
\end{displaymath}

denotes the (arithmetic) mean of a population of N observations, and

\begin{displaymath}
\overline{X} = \frac{1}{n}\sum_{i=1}^n X_i,
\end{displaymath}

denotes the mean of a sample of size n selected from a population. The mean can be thought of as a center of gravity of the data values. That is, the histogram of the data would balance at the location defined by the mean. We can express this property mathematically by noting that the mean is the solution to the equation,

\begin{displaymath}
\sum_{i=1}^n(X_i - c) = 0.
\end{displaymath}

This property of the mean has advantages and disadvantages. The mean is a natural measure of location for data that have a well-defined middle of high concentration with the frequency decreasing more or less evenly as we move away from the middle in either direction. The mean is not as useful when the data is heavily skewed. This is illustrated in the following two histograms. The first is the histogram of savings ratio with its mean superimposed, and the second is the histogram of disposable income.

Image stat3355num1 Image stat3355num2

Another disadvantage of this measure is that it is very sensitive to the presence of a relatively few extreme observations. For example, the following data gives some quantities associated with 60 automobiles.

                              Weight Disp. Mileage     Fuel    Type
Eagle Summit 4                  2560    97      33 3.030303   Small
Ford Escort   4                 2345   114      33 3.030303   Small
Ford Festiva 4                  1845    81      37 2.702703   Small
Honda Civic 4                   2260    91      32 3.125000   Small
Mazda Protege 4                 2440   113      32 3.125000   Small
Mercury Tracer 4                2285    97      26 3.846154   Small
Nissan Sentra 4                 2275    97      33 3.030303   Small
Pontiac LeMans 4                2350    98      28 3.571429   Small
Subaru Loyale 4                 2295   109      25 4.000000   Small
Subaru Justy 3                  1900    73      34 2.941176   Small
Toyota Corolla 4                2390    97      29 3.448276   Small
Toyota Tercel 4                 2075    89      35 2.857143   Small
Volkswagen Jetta 4              2330   109      26 3.846154   Small
Chevrolet Camaro V8             3320   305      20 5.000000  Sporty
Dodge Daytona                   2885   153      27 3.703704  Sporty
Ford Mustang V8                 3310   302      19 5.263158  Sporty
Ford Probe                      2695   133      30 3.333333  Sporty
Honda Civic CRX Si 4            2170    97      33 3.030303  Sporty
Honda Prelude Si 4WS 4          2710   125      27 3.703704  Sporty
Nissan 240SX 4                  2775   146      24 4.166667  Sporty
Plymouth Laser                  2840   107      26 3.846154  Sporty
Subaru XT 4                     2485   109      28 3.571429  Sporty
Audi 80 4                       2670   121      27 3.703704 Compact
Buick Skylark 4                 2640   151      23 4.347826 Compact
Chevrolet Beretta 4             2655   133      26 3.846154 Compact
Chrysler Le Baron V6            3065   181      25 4.000000 Compact
Ford Tempo 4                    2750   141      24 4.166667 Compact
Honda Accord 4                  2920   132      26 3.846154 Compact
Mazda 626 4                     2780   133      24 4.166667 Compact
Mitsubishi Galant 4             2745   122      25 4.000000 Compact
Mitsubishi Sigma V6             3110   181      21 4.761905 Compact
Nissan Stanza 4                 2920   146      21 4.761905 Compact
Oldsmobile Calais 4             2645   151      23 4.347826 Compact
Peugeot 405 4                   2575   116      24 4.166667 Compact
Subaru Legacy 4                 2935   135      23 4.347826 Compact
Toyota Camry 4                  2920   122      27 3.703704 Compact
Volvo 240 4                     2985   141      23 4.347826 Compact
Acura Legend V6                 3265   163      20 5.000000  Medium
Buick Century 4                 2880   151      21 4.761905  Medium
Chrysler Le Baron Coupe         2975   153      22 4.545455  Medium
Chrysler New Yorker V6          3450   202      22 4.545455  Medium
Eagle Premier V6                3145   180      22 4.545455  Medium
Ford Taurus V6                  3190   182      22 4.545455  Medium
Ford Thunderbird V6             3610   232      23 4.347826  Medium
Hyundai Sonata 4                2885   143      23 4.347826  Medium
Mazda 929 V6                    3480   180      21 4.761905  Medium
Nissan Maxima V6                3200   180      22 4.545455  Medium
Oldsmobile Cutlass Ciera 4      2765   151      21 4.761905  Medium
Oldsmobile Cutlass Supreme V6   3220   189      21 4.761905  Medium
Toyota Cressida 6               3480   180      23 4.347826  Medium
Buick Le Sabre V6               3325   231      23 4.347826   Large
Chevrolet Caprice V8            3855   305      18 5.555556   Large
Ford LTD Crown Victoria V8      3850   302      20 5.000000   Large
Chevrolet Lumina APV V6         3195   151      18 5.555556     Van
Dodge Grand Caravan V6          3735   202      18 5.555556     Van
Ford Aerostar V6                3665   182      18 5.555556     Van
Mazda MPV V6                    3735   181      19 5.263158     Van
Mitsubishi Wagon 4              3415   143      20 5.000000     Van
Nissan Axxess 4                 3185   146      20 5.000000     Van
Nissan Van 4                    3690   146      19 5.263158     Van
The 4 plots given below represent histograms of Weight with the mean of Weight superimposed. The second, third, and fourth plots are histograms of Weight with the values 10000, 25000, and 70000, respectively, added to the dataset. The blue line is the original mean and the red lines are the means of the modified data.

Image stat3355num3

An alternative measure of location is the median. This measure is defined to be a number such that half of the measurements are below this number and half are above. The advantage of this measure is that it is not sensitive to the presence of a few outliers. Also, it gives an intuitive description of location regardless of the shape of the histogram. The median is obtained by first ordering the data values from smallest to largest. it the number of observations n is odd, then the median is the ordered value in position (n+1)/2. If n is even, then the median is half-way between the n/2 and n/2 + 1 ordered values.

The plots below are identical to the previous plots except that the median is superimposed in black on each histogram. Note that the location of the median is much more stable than the mean. For that reason the median is used to describe the middle of data such as real estate prices and wages.

Image stat3355num4

The mode is simply the most frequently occurring measurement or category. It is not used much except for some very specialized applications.

R notes:

There is a dataset named state.x77 in R that is a matrix with 50 rows and 8 columns. We can obtain the means for each column using the function colMeans:

state.means = colMeans(state.x77)
This function is a shortcut for:
state.means = apply(state.x77,2,mean)
There also is a vector named state.region giving the geographic region (Northeast, South, North Central, West) for each state. We can use this to extract data for states belonging to a particular region as follows.
NorthEast.x77 = state.x77[state.region == "Northeast",]
South.x77 = state.x77[state.region == "South",]
NorthCentral.x77 = state.x77[state.region == "North Central",]
West.x77 = state.x77[state.region == "West",]
Suppose we wanted to build a matrix that contains the means for each variable within each region so that rows correspond to region and columns correspond to variables. We could accomplish that as follows.
#construct blank matrix with dimnames
Region.means = matrix(0,4,dim(state.x77)[2],
              dimnames=list(levels(state.region),dimnames(state.x77)[[2]]))
Region.means["Northeast",] = colMeans(NorthEast.x77)
Region.means["South",] = colMeans(South.x77)
Region.means["North Central",] = colMeans(NorthCentral.x77)
Region.means["West",] = colMeans(West.x77)
Region.means
round(Region.means,2)

Now suppose we wanted to categorize states by region and by whether or not they are above average in Illiteracy.

table(state.region,state.x77[,"Illiteracy"] > state.means["Illiteracy"])
We can make this frequency table look better by giving more informative names to the Illiteracy columns.
Region = state.region #give state.region a better name
# create logical vector that indicates above average or not
Illiteracy = state.x77[,"Illiteracy"] > state.means["Illiteracy"]
# assign the name Region.Illiteracy to freq table
Region.Illiteracy = table(Region,Illiteracy)
# change col names of this table
dimnames(Region.Illiteracy)[[2]] = c("Below Average","Above Average")
Region.Illiteracy
Another way to do this that gives access so R's object-oriented behavior is to convert the Illiteracy vector to a factor.
Region = state.region #give state.region a better name
# create logical factor that indicates above average or not
Illiteracy = factor(state.x77[,"Illiteracy"] > state.means["Illiteracy"])
# factor function automatically orders the levels alphabetically, so in this case
# levels are FALSE, TRUE
levels(Illiteracy)
# assign new names for these levels
levels(Illiteracy) = c("Below Average","Above Average")
# assign the name Region.Illiteracy to freq table
Region.Illiteracy = table(Region,Illiteracy)
# now we don't need to change col names of this table
Region.Illiteracy
# Plot income vs Illiteracy as a factor instead of a numeric variable
plot(state.x77[,"Income"] ~ Illiteracy,ylab="Income",col=c("cyan")
# add a horizontal line at the overall mean income
abline(h=state.means["Income"])
# add title and sub-title
title("Per Capita Income vs Illiteracy")
title(sub="Horizontal line is at overall mean income")

Note that state.x77 is a matrix, not a data frame.

is.data.frame(state.x77)
# make a data frame from this matrix
State77 = data.frame(state.x77)
# compare the following two plot commands:
plot(Income ~ Illiteracy, data=State77)
plot(State77$Income ~ Illiteracy,ylab="Income")


next up previous
Next: Measures of Dispersion Up: Numerical summaries of data Previous: Numerical summaries of data
Larry Ammann
2013-12-17