It is possible to have two very different datasets with the same means and medians. For that reason, measures of the middle are useful but limited. Another important attribute of a dataset is its dispersion or variability about its middle. The most useful measures of dispersion are the range, percentiles, and the standard deviation. The range is the difference between the largest and the smallest data values. Therefore, the more spread out the data values are, the larger the range will be. However, if a few observations are relatively far from the middle but the rest are relatively close to the middle, the range can give a distorted measure of dispersion.
Percentiles are positional measures for a dataset that enable one to determine the relative standing of a single measurement within the dataset. In particular, the is defined to be a number such that of the observations are less than or equal to that number and are greater than that number. So, for example, an observation that is at the is less than only 25% of the data. In practice, we often cannot satisfy the definition exactly. However, the steps outlined below at least satisfies the spirit of the definition.
The percentile is the median and partitions the data into a lower half (below median) and upper half (above median). The , , percentiles are referred to as quartiles. They partition the data into 4 groups with 25% of the values below the percentile (lower quartile), 25% between the lower quartile and the median, 25% between the median and the percentile (upper quartile), and 25% above the upper quartile. The difference between the upper and lower quartiles is referred to as the inter-quartile range. This is the range of the middle 50% of the data.
The third measure of dispersion we will consider here is associated with the concept of distance between a number and a set of data. Suppose we are interested in a particular dataset and would like to summarize the information in that data with a single value that represents the closest number to the data. To accomplish this requires that we first define a measure of distance between a number and a dataset. One such measure can be defined as the total distance between the number and the values in the dataset. That is, the distance between a number c and a set of data values, , would be
It can be shown that the value that minimizes D(c) is the median. However, this measure of distance is not widely used for several reasons, one of which is that this minimization problem does not always have a unique solution.
An alternative measure of distance between a number and a set of data that is widely used and does have a unique solution is defined by,
That is, the distance between a number c and the data is the sum of the squared distances between c and each data value. We can take as our single number summary the value of c that is closest to the dataset, i.e., the value of c which minimizes . It can be shown that the value that minimizes this distance is . This is accomplished by differentiating D(c) with respect to c and setting the derivative equal to 0.
As we have already seen, the solution to this equation is . The graphic below gives a histogram of the Weight data with the distance function D(c) superimposed. This graph shows that the minimum distance occurs at the mean of Weight.
The mean is the closest single number to the data when we define distance by the square of the deviation between the number and a data value. The average distance between the data and the mean is referred to as the variance of the data. We make a notational distinction and a minor arithmetic distinction between variance defined for populations and variance defined for samples. We use
for population variances, and
for sample variances. Note that the unit of measure for the variance is the square of the unit of measure for the data. For that reason (and others), the square root of the variance, called the standard deviation, is more commonly used as a measure of dispersion,
Note that datasets in which the values tend to be far away from the middle have a large variance (and hence large standard deviation), and datasets in which the values cluster closely around the middle have small variance. Unfortunately, it is also the case that a data set with one value very far from the middle and the rest very close to the middle also will have a large variance.
The standard deviation of a dataset can be interpreted by Chebychev's Theorem:
for any , the proportion of observations within the interval is at least .For example, the mean of the Mileage data is 24.583 and the standard deviation is 4.79. Therefore, at least 75% of the cars in this dataset have weights between and . Chebychev's theorem is very conservative since it is applicable to every dataset. The actual number of cars whose Mileage falls in the interval (15.003,34.163) is 58, corresponding to 96.7%. Nevertheless, knowing just the mean and standard deviation of a dataset allows us to obtain a rough picture of the distribution of the data values. Note that the smaller the standard deviation, the smaller is the interval that is guaranteed to contain at least 75% of the observations. Conversely, the larger the standard deviation, the more likely it is that an observation will not be close to the mean. From the point of view of a manufacturer, reduction in variability of some product characteristic would correspond to an increase of consistency of the product. From the point of view of a financial manager, variability of a portfolio's return is referred to as volatility.
Note that Chebychev's Theorem applies to all data and therefore must
be conservative. In many situations the actual percentages contained within
these intervals are much higher than the minimums specified by this theorem.
If the shape of the data histogram is known, then better results can be
given. In particular, if it is known that the data histogram is approximately
bell-shaped, then we can say
contains approximately 68%,
contains approximately 95%,
contains essentially all
of the data values. This set of results is called the empirical rule. Later in the course we will study the bell-shaped curve (known as the normal distribution) in more detail.
The relative position of an observation in a data set can be represented by its distance from the mean expressed in terms of the s.d. That is,
and is referred to as the z-score of the observation. NPositive z-scores are above the mean, negative z-scores are below the mean. Z-scores greater than 2 are more than 2 s.d.'s above the mean. From Chebychev's theorem, at least 75% of observations in any dataset will have z-scores between -2 and 2
Since z-scores are dimension-less, then we can compare the relative positions of observations from different populations or samples by comparing their respective z-scores. For example, directly comparing the heights of a husband and wife would not be appropriate since males tend to be taller than females. However, if we knew the means and s.d.'s of males and females, then we could compare their z-scores. This comparison would be more meaningful than a direct comparison of their heights.
If the data histogram is approximately bell-shaped, then essentially all values should be within 3 s.d.'s of the mean, which is an interval of width 6 s.d.'s. A small number of observations that are unusually large or small can greatly inflate the s.d. Such observations are referred to as outliers. Identification of outliers is important, but this can be difficult since they will distort the mean and the s.d. For that reason, we can't simply use or for this purpose. We instead make use of some relationships between quartiles and the s.d. of bell-shaped data. In particular, if the data histogram is approximately bell-shaped, then . This relationship can be used to define a robust estimate of the s.d. which is then used to identify outliers. Observations that are more than from the nearest quartile are considered to be outliers. Boxplots in R are constructed so that the box edges are at the quartiles, the median is marked by a line within the box, and this the box is extended by whiskers indicating the range of observations that are no more than 1.5(IQR) from the nearest quartile. Any observations falling outside this range are plotted with a circle. For example, the following plot shows boxplots of mileage for each automobile type.
Note that this plot shows how the quantitative variable Mileage and the categorical variable Type are related.
R Notes. The data set
is used to illustrate Chebychev's Theorem and the empirical rule. This is a csv file that contains two columns: BirthWeight gives weight of babies born to 1226 mothers and Smoker indicates whether or not the mother was a smoker.
# import data into R BW = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/BirthwtSmoke.csv",header=TRUE,sep=",") # note that Smoker is automatically converted to a factor # obtain mean and s.d. for all babies allBirthWeights = BW[,"BirthWeight"] meanAllWeights = mean(allBirthWeights) sdAllWeights = sd(allBirthWeights) # construct histogram of all weights hist(allBirthWeights, main="Histogram of Birth Weights\nAll Mothers included",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meanAllWeights + 2*c(-1,1)*sdAllWeights cat("At least 3/4 of birth weights are in the interval\n") cat(paste("[",round(cheb.int,1),", ", round(cheb.int,1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval allprop = mean(allBirthWeights > cheb.int & allBirthWeights < cheb.int) cat(paste("Actual proportion in this interval is",round(allprop,3)),"\n")
Next repeat this separately for mothers who smoke and mothers who don't smoke.
# extract weights for mothers who smoked smokeBirthWeights = allBirthWeights[BW$Smoker == "Yes"] meanSmokeWeights = mean(smokeBirthWeights) sdSmokeWeights = sd(smokeBirthWeights) # construct histogram of smoke weights hist(smokeBirthWeights, main="Histogram of Birth Weights: Smoking Mothers",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meanSmokeWeights + 2*c(-1,1)*sdSmokeWeights cat("At least 3/4 of birth weights from mothers who smoked are in the interval\n") cat(paste("[",round(cheb.int,1),", ", round(cheb.int,1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval smokeprop = mean(smokeBirthWeights > cheb.int & smokeBirthWeights < cheb.int) cat(paste("Actual proportion in this interval is",round(smokeprop,3)),"\n") # extract weights for mothers who did not smoke nonSmokeBirthWeights = allBirthWeights[BW$Smoker == "No"] meannonSmokeWeights = mean(nonSmokeBirthWeights) sdnonSmokeWeights = sd(nonSmokeBirthWeights) # construct histogram of non smoker weights hist(nonSmokeBirthWeights, main="Histogram of Birth Weights: Non-smoking Mothers",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meannonSmokeWeights + 2*c(-1,1)*sdnonSmokeWeights cat("\nAt least 3/4 of birth weights from mothers who did not smoke are in the interval\n") cat(paste("[",round(cheb.int,1),", ", round(cheb.int,1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval nonsmokeprop = mean(nonSmokeBirthWeights > cheb.int & nonSmokeBirthWeights < cheb.int) cat(paste("Actual proportion in this interval is",round(nonsmokeprop,3)),"\n") # now create graphic with both histograms aligned vertically # use same x-axis limits to make them comparable png("WeightHists.png",width=600,height=960) par(mfrow=c(2,1),oma=c(1,0,0,0)) Smoke.tab = table(BW$Smoker) hist(smokeBirthWeights, main="",col="cyan",xlab="Birth weight",xlim=range(allBirthWeights)) title(sub=paste("Smoking Mothers: n =",Smoke.tab["Yes"])) mtext("Histogram of Birth Weights",outer=TRUE,cex=1.2,font=2,line=-2) hist(nonSmokeBirthWeights, main="",col="cyan",xlab="Birth weight",xlim=range(allBirthWeights)) title(sub=paste("Non-smoking Mothers: n =",Smoke.tab["No"])) graphics.off()
A more effective way to visualize the differences in birth weights between mothers who smoke and those who do not is to use boxplots. These can be obtained through the plot() function. This function is what is referred to in R as a generic function. For this data what we would like to show is how birth weights depend on smoking status of mothers. We can do this using the formula interface of plot() as follows.
plot(BirthWeight ~ Smoker, data=BW)The first argument is the formula which can be read as: BirthWeight depends on Smoker. The data=BW argument tells R that the names used in the formula are variables in a data frame named BW. In this case the response variable BirthWeight is a numeric variable and the independent variable Smoker is a factor. For this type of formula plot() generates separate boxplots for each level of the factor. The box contains the middle 50% of the responses for a group (lower quartile - upper quartile) and the middle line within the box represents the group mean. The dashed lines and whisker represent a robust estimate of a 95% coverage interval derived from the median and inter-quartile range instead of the mean and s.d. Now let's create a stand-alone script that makes this plot look better by adding color, a title, and group sizes.
BW = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/BirthwtSmoke.csv",header=TRUE,sep=",") bw.col = c("SkyBlue","orange") png("BirthWeightBox.png",width=600,height=600) plot(BirthWeight ~ Smoker, data=BW, col=bw.col,ylab="Birth Weight") title("Birth Weights vs Smoking Status of Mothers") Smoke.tab = table(BW$Smoker) axis(side=1, at=seq(2), labels=paste("n=",Smoke.tab, sep=""), tick=FALSE, line=1) graphics.off()