The **mean** usually refers to the arithmetic mean or average. This is
just the sum of the measurements divided by the number of measurements. We make
a notational distinction between the mean of a population and the mean of a
sample. The general rule is that greek letters are used for population
characteristics and latin letters are used for sample characteristics.
Therefore,

denotes the (arithmetic) mean of a population of

denotes the mean of a sample of size

This property of the mean has advantages and disadvantages. The mean is a natural measure of location for data that have a well-defined middle of high concentration with the frequency decreasing more or less evenly as we move away from the middle in either direction. The mean is not as useful when the data is heavily skewed. This is illustrated in the following two histograms. The first is the histogram of savings ratio with its mean superimposed, and the second is the histogram of disposable income.

Another disadvantage of this measure is that it is very sensitive to the presence of a relatively few extreme observations. For example, the following data gives some quantities associated with 60 automobiles.

Weight Disp. Mileage Fuel Type Eagle Summit 4 2560 97 33 3.030303 Small Ford Escort 4 2345 114 33 3.030303 Small Ford Festiva 4 1845 81 37 2.702703 Small Honda Civic 4 2260 91 32 3.125000 Small Mazda Protege 4 2440 113 32 3.125000 Small Mercury Tracer 4 2285 97 26 3.846154 Small Nissan Sentra 4 2275 97 33 3.030303 Small Pontiac LeMans 4 2350 98 28 3.571429 Small Subaru Loyale 4 2295 109 25 4.000000 Small Subaru Justy 3 1900 73 34 2.941176 Small Toyota Corolla 4 2390 97 29 3.448276 Small Toyota Tercel 4 2075 89 35 2.857143 Small Volkswagen Jetta 4 2330 109 26 3.846154 Small Chevrolet Camaro V8 3320 305 20 5.000000 Sporty Dodge Daytona 2885 153 27 3.703704 Sporty Ford Mustang V8 3310 302 19 5.263158 Sporty Ford Probe 2695 133 30 3.333333 Sporty Honda Civic CRX Si 4 2170 97 33 3.030303 Sporty Honda Prelude Si 4WS 4 2710 125 27 3.703704 Sporty Nissan 240SX 4 2775 146 24 4.166667 Sporty Plymouth Laser 2840 107 26 3.846154 Sporty Subaru XT 4 2485 109 28 3.571429 Sporty Audi 80 4 2670 121 27 3.703704 Compact Buick Skylark 4 2640 151 23 4.347826 Compact Chevrolet Beretta 4 2655 133 26 3.846154 Compact Chrysler Le Baron V6 3065 181 25 4.000000 Compact Ford Tempo 4 2750 141 24 4.166667 Compact Honda Accord 4 2920 132 26 3.846154 Compact Mazda 626 4 2780 133 24 4.166667 Compact Mitsubishi Galant 4 2745 122 25 4.000000 Compact Mitsubishi Sigma V6 3110 181 21 4.761905 Compact Nissan Stanza 4 2920 146 21 4.761905 Compact Oldsmobile Calais 4 2645 151 23 4.347826 Compact Peugeot 405 4 2575 116 24 4.166667 Compact Subaru Legacy 4 2935 135 23 4.347826 Compact Toyota Camry 4 2920 122 27 3.703704 Compact Volvo 240 4 2985 141 23 4.347826 Compact Acura Legend V6 3265 163 20 5.000000 Medium Buick Century 4 2880 151 21 4.761905 Medium Chrysler Le Baron Coupe 2975 153 22 4.545455 Medium Chrysler New Yorker V6 3450 202 22 4.545455 Medium Eagle Premier V6 3145 180 22 4.545455 Medium Ford Taurus V6 3190 182 22 4.545455 Medium Ford Thunderbird V6 3610 232 23 4.347826 Medium Hyundai Sonata 4 2885 143 23 4.347826 Medium Mazda 929 V6 3480 180 21 4.761905 Medium Nissan Maxima V6 3200 180 22 4.545455 Medium Oldsmobile Cutlass Ciera 4 2765 151 21 4.761905 Medium Oldsmobile Cutlass Supreme V6 3220 189 21 4.761905 Medium Toyota Cressida 6 3480 180 23 4.347826 Medium Buick Le Sabre V6 3325 231 23 4.347826 Large Chevrolet Caprice V8 3855 305 18 5.555556 Large Ford LTD Crown Victoria V8 3850 302 20 5.000000 Large Chevrolet Lumina APV V6 3195 151 18 5.555556 Van Dodge Grand Caravan V6 3735 202 18 5.555556 Van Ford Aerostar V6 3665 182 18 5.555556 Van Mazda MPV V6 3735 181 19 5.263158 Van Mitsubishi Wagon 4 3415 143 20 5.000000 Van Nissan Axxess 4 3185 146 20 5.000000 Van Nissan Van 4 3690 146 19 5.263158 VanThe 4 plots given below represent histograms of

An alternative measure of location is the **median**. This measure is
defined to be a number such that half of the measurements are below this
number and half are above. The advantage of this measure is that it is not
sensitive to the presence of a few outliers. Also, it gives an intuitive
description of location regardless of the shape of the histogram. The median
is obtained by first ordering the data values from smallest to largest.
it the number of observations *n* is odd, then the median is the
ordered value in position *(n+1)/2*. If *n* is even, then
the median is half-way between the *n/2* and *n/2 + 1*
ordered values.

The plots below are identical to the previous plots except that the median is superimposed in black on each histogram. Note that the location of the median is much more stable than the mean. For that reason the median is used to describe the middle of data such as real estate prices and wages.

The **mode** is simply the most frequently occurring measurement or
category. It is not used much except for some very specialized applications.

**R notes:**

There is a dataset named *state.x77* in **R** that is a matrix with
50 rows and 8 columns. We can obtain the means for each column using the
function *colMeans*:

state.means = colMeans(state.x77)This function is a shortcut for:

state.means = apply(state.x77,2,mean)There also is a vector named

NorthEast.x77 = state.x77[state.region == "Northeast",] South.x77 = state.x77[state.region == "South",] NorthCentral.x77 = state.x77[state.region == "North Central",] West.x77 = state.x77[state.region == "West",]Suppose we wanted to build a matrix that contains the means for each variable within each region so that rows correspond to region and columns correspond to variables. We could accomplish that as follows.

#construct blank matrix with dimnames Region.means = matrix(0,4,dim(state.x77)[2], dimnames=list(levels(state.region),dimnames(state.x77)[[2]])) Region.means["Northeast",] = colMeans(NorthEast.x77) Region.means["South",] = colMeans(South.x77) Region.means["North Central",] = colMeans(NorthCentral.x77) Region.means["West",] = colMeans(West.x77) Region.means round(Region.means,2)

Now suppose we wanted to categorize states by region and by whether or not they are above average in Illiteracy.

table(state.region,state.x77[,"Illiteracy"] > state.means["Illiteracy"])We can make this frequency table look better by giving more informative names to the Illiteracy columns.

Region = state.region #give state.region a better name # create logical vector that indicates above average or not Illiteracy = state.x77[,"Illiteracy"] > state.means["Illiteracy"] # assign the name Region.Illiteracy to freq table Region.Illiteracy = table(Region,Illiteracy) # change col names of this table dimnames(Region.Illiteracy)[[2]] = c("Below Average","Above Average") Region.IlliteracyAnother way to do this that gives access so R's object-oriented behavior is to convert the Illiteracy vector to a factor.

Region = state.region #give state.region a better name # create logical factor that indicates above average or not Illiteracy = factor(state.x77[,"Illiteracy"] > state.means["Illiteracy"]) # factor function automatically orders the levels alphabetically, so in this case # levels are FALSE, TRUE levels(Illiteracy) # assign new names for these levels levels(Illiteracy) = c("Below Average","Above Average") # assign the name Region.Illiteracy to freq table Region.Illiteracy = table(Region,Illiteracy) # now we don't need to change col names of this table Region.Illiteracy # Plot income vs Illiteracy as a factor instead of a numeric variable plot(state.x77[,"Income"] ~ Illiteracy,ylab="Income",col=c("cyan") # add a horizontal line at the overall mean income abline(h=state.means["Income"]) # add title and sub-title title("Per Capita Income vs Illiteracy") title(sub="Horizontal line is at overall mean income")

Note that *state.x77* is a matrix, not a data frame.

is.data.frame(state.x77) # make a data frame from this matrix State77 = data.frame(state.x77) # compare the following two plot commands: plot(Income ~ Illiteracy, data=State77) plot(State77$Income ~ Illiteracy,ylab="Income")

2014-12-08