The mean usually refers to the arithmetic mean or average. This is
just the sum of the measurements divided by the number of measurements. We make
a notational distinction between the mean of a population and the mean of a
sample. The general rule is that greek letters are used for population
characteristics and latin letters are used for sample characteristics.
Therefore,
This property of the mean has advantages and disadvantages. The mean is a natural measure of location for data that have a well-defined middle of high concentration with the frequency decreasing more or less evenly as we move away from the middle in either direction. The mean is not as useful when the data is heavily skewed. This is illustrated in the following two histograms. The first is the histogram of savings ratio with its mean superimposed, and the second is the histogram of disposable income.
Another disadvantage of this measure is that it is very sensitive to the presence of a relatively few extreme observations. For example, the following data gives some quantities associated with 60 automobiles.
Weight Disp. Mileage Fuel Type
Eagle Summit 4 2560 97 33 3.030303 Small
Ford Escort 4 2345 114 33 3.030303 Small
Ford Festiva 4 1845 81 37 2.702703 Small
Honda Civic 4 2260 91 32 3.125000 Small
Mazda Protege 4 2440 113 32 3.125000 Small
Mercury Tracer 4 2285 97 26 3.846154 Small
Nissan Sentra 4 2275 97 33 3.030303 Small
Pontiac LeMans 4 2350 98 28 3.571429 Small
Subaru Loyale 4 2295 109 25 4.000000 Small
Subaru Justy 3 1900 73 34 2.941176 Small
Toyota Corolla 4 2390 97 29 3.448276 Small
Toyota Tercel 4 2075 89 35 2.857143 Small
Volkswagen Jetta 4 2330 109 26 3.846154 Small
Chevrolet Camaro V8 3320 305 20 5.000000 Sporty
Dodge Daytona 2885 153 27 3.703704 Sporty
Ford Mustang V8 3310 302 19 5.263158 Sporty
Ford Probe 2695 133 30 3.333333 Sporty
Honda Civic CRX Si 4 2170 97 33 3.030303 Sporty
Honda Prelude Si 4WS 4 2710 125 27 3.703704 Sporty
Nissan 240SX 4 2775 146 24 4.166667 Sporty
Plymouth Laser 2840 107 26 3.846154 Sporty
Subaru XT 4 2485 109 28 3.571429 Sporty
Audi 80 4 2670 121 27 3.703704 Compact
Buick Skylark 4 2640 151 23 4.347826 Compact
Chevrolet Beretta 4 2655 133 26 3.846154 Compact
Chrysler Le Baron V6 3065 181 25 4.000000 Compact
Ford Tempo 4 2750 141 24 4.166667 Compact
Honda Accord 4 2920 132 26 3.846154 Compact
Mazda 626 4 2780 133 24 4.166667 Compact
Mitsubishi Galant 4 2745 122 25 4.000000 Compact
Mitsubishi Sigma V6 3110 181 21 4.761905 Compact
Nissan Stanza 4 2920 146 21 4.761905 Compact
Oldsmobile Calais 4 2645 151 23 4.347826 Compact
Peugeot 405 4 2575 116 24 4.166667 Compact
Subaru Legacy 4 2935 135 23 4.347826 Compact
Toyota Camry 4 2920 122 27 3.703704 Compact
Volvo 240 4 2985 141 23 4.347826 Compact
Acura Legend V6 3265 163 20 5.000000 Medium
Buick Century 4 2880 151 21 4.761905 Medium
Chrysler Le Baron Coupe 2975 153 22 4.545455 Medium
Chrysler New Yorker V6 3450 202 22 4.545455 Medium
Eagle Premier V6 3145 180 22 4.545455 Medium
Ford Taurus V6 3190 182 22 4.545455 Medium
Ford Thunderbird V6 3610 232 23 4.347826 Medium
Hyundai Sonata 4 2885 143 23 4.347826 Medium
Mazda 929 V6 3480 180 21 4.761905 Medium
Nissan Maxima V6 3200 180 22 4.545455 Medium
Oldsmobile Cutlass Ciera 4 2765 151 21 4.761905 Medium
Oldsmobile Cutlass Supreme V6 3220 189 21 4.761905 Medium
Toyota Cressida 6 3480 180 23 4.347826 Medium
Buick Le Sabre V6 3325 231 23 4.347826 Large
Chevrolet Caprice V8 3855 305 18 5.555556 Large
Ford LTD Crown Victoria V8 3850 302 20 5.000000 Large
Chevrolet Lumina APV V6 3195 151 18 5.555556 Van
Dodge Grand Caravan V6 3735 202 18 5.555556 Van
Ford Aerostar V6 3665 182 18 5.555556 Van
Mazda MPV V6 3735 181 19 5.263158 Van
Mitsubishi Wagon 4 3415 143 20 5.000000 Van
Nissan Axxess 4 3185 146 20 5.000000 Van
Nissan Van 4 3690 146 19 5.263158 Van
The 4 plots given below represent histograms of Weight with the mean
of Weight superimposed. The second, third, and fourth plots are histograms of
Weight with the values 10000, 25000, and 70000, respectively, added to the
dataset. The blue line is the original mean and the red lines are the
means of the modified data.
An alternative measure of location is the median. This measure is defined to be a number such that half of the measurements are below this number and half are above. The advantage of this measure is that it is not sensitive to the presence of a few outliers. Also, it gives an intuitive description of location regardless of the shape of the histogram. The median is obtained by first ordering the data values from smallest to largest. it the number of observations n is odd, then the median is the ordered value in position (n+1)/2. If n is even, then the median is half-way between the n/2 and n/2 + 1 ordered values.
The plots below are identical to the previous plots except that the median is superimposed in black on each histogram. Note that the location of the median is much more stable than the mean. For that reason the median is used to describe the middle of data such as real estate prices and wages.
The mode is simply the most frequently occurring measurement or category. It is not used much except for some very specialized applications.
R notes:
There is a dataset named state.x77 in R that is a matrix with 50 rows and 8 columns. We can obtain the means for each column using the function colMeans:
state.means = colMeans(state.x77)This function is a shortcut for:
state.means = apply(state.x77,2,mean)There also is a vector named state.region giving the geographic region (Northeast, South, North Central, West) for each state. We can use this to extract data for states belonging to a particular region as follows.
NorthEast.x77 = state.x77[state.region == "Northeast",] South.x77 = state.x77[state.region == "South",] NorthCentral.x77 = state.x77[state.region == "North Central",] West.x77 = state.x77[state.region == "West",]Suppose we wanted to build a matrix that contains the means for each variable within each region so that rows correspond to region and columns correspond to variables. We could accomplish that as follows.
#construct blank matrix with dimnames
Region.means = matrix(0,4,dim(state.x77)[2],
dimnames=list(levels(state.region),dimnames(state.x77)[[2]]))
Region.means["Northeast",] = colMeans(NorthEast.x77)
Region.means["South",] = colMeans(South.x77)
Region.means["North Central",] = colMeans(NorthCentral.x77)
Region.means["West",] = colMeans(West.x77)
Region.means
round(Region.means,2)
Now suppose we wanted to categorize states by region and by whether or not they are above average in Illiteracy.
table(state.region,state.x77[,"Illiteracy"] > state.means["Illiteracy"])We can make this frequency table look better by giving more informative names to the Illiteracy columns.
Region = state.region #give state.region a better name
# create logical vector that indicates above average or not
Illiteracy = state.x77[,"Illiteracy"] > state.means["Illiteracy"]
# assign the name Region.Illiteracy to freq table
Region.Illiteracy = table(Region,Illiteracy)
# change col names of this table
dimnames(Region.Illiteracy)[[2]] = c("Below Average","Above Average")
Region.Illiteracy
Another way to do this that gives access so R's object-oriented behavior is to convert the Illiteracy vector
to a factor.
Region = state.region #give state.region a better name
# create logical factor that indicates above average or not
Illiteracy = factor(state.x77[,"Illiteracy"] > state.means["Illiteracy"])
# factor function automatically orders the levels alphabetically, so in this case
# levels are FALSE, TRUE
levels(Illiteracy)
# assign new names for these levels
levels(Illiteracy) = c("Below Average","Above Average")
# assign the name Region.Illiteracy to freq table
Region.Illiteracy = table(Region,Illiteracy)
# now we don't need to change col names of this table
Region.Illiteracy
# Plot income vs Illiteracy as a factor instead of a numeric variable
plot(state.x77[,"Income"] ~ Illiteracy,ylab="Income",col=c("cyan")
# add a horizontal line at the overall mean income
abline(h=state.means["Income"])
# add title and sub-title
title("Per Capita Income vs Illiteracy")
title(sub="Horizontal line is at overall mean income")
Note that state.x77 is a matrix, not a data frame.
is.data.frame(state.x77) # make a data frame from this matrix State77 = data.frame(state.x77) # compare the following two plot commands: plot(Income ~ Illiteracy, data=State77) plot(State77$Income ~ Illiteracy,ylab="Income")