It is possible to have two very different datasets with the same means and medians. For that reason, measures of the middle are useful but limited. Another important attribute of a dataset is its dispersion or variability about its middle. The most useful measures of dispersion are the range, percentiles, and the standard deviation. The range is the difference between the largest and the smallest data values. Therefore, the more spread out the data values are, the larger the range will be. However, if a few observations are relatively far from the middle but the rest are relatively close to the middle, the range can give a distorted measure of dispersion.
Percentiles are positional measures for a dataset that enable one to determine the relative standing of a single measurement within the dataset. In particular, the is defined to be a number such that of the observations are less than or equal to that number and are greater than that number. So, for example, an observation that is at the is less than only 25% of the data. In practice, we often cannot satisfy the definition exactly. However, the steps outlined below at least satisfies the spirit of the definition.
The percentile is the median and partitions the data into a lower half (below median) and upper half (above median). The , , percentiles are referred to as quartiles. They partition the data into 4 groups with 25% of the values below the percentile (lower quartile), 25% between the lower quartile and the median, 25% between the median and the percentile (upper quartile), and 25% above the upper quartile. The difference between the upper and lower quartiles is referred to as the inter-quartile range. This is the range of the middle 50% of the data.
The third measure of dispersion we will consider here is associated with the
concept of distance between a number and a set of data. Suppose we are
interested in a particular dataset and would like to summarize the information
in that data with a single value that represents the closest number to
the data. To accomplish this requires that we first define a measure of
distance between a number and a dataset. One such measure can be defined as the
total distance between the number and the values in the dataset. That
is, the distance between a number c and a set of data values,
, would be
An alternative measure of distance between a number and a set of data that
is widely used and does have a unique solution is defined by,
The mean is the closest single number to the data when we define distance by
the square of the deviation between the number and a data value. The
average distance between the data and the mean is referred to as the
variance of the data. We make a notational distinction and a minor
arithmetic distinction between variance defined for populations and variance
defined for samples. We use
The standard deviation of a dataset can be interpreted by Chebychev's Theorem:
for any , the proportion of observations within the interval is at least .For example, the mean of the Mileage data is 24.583 and the standard deviation is 4.79. Therefore, at least 75% of the cars in this dataset have weights between and . Chebychev's theorem is very conservative since it is applicable to every dataset. The actual number of cars whose Mileage falls in the interval (15.003,34.163) is 58, corresponding to 96.7%. Nevertheless, knowing just the mean and standard deviation of a dataset allows us to obtain a rough picture of the distribution of the data values. Note that the smaller the standard deviation, the smaller is the interval that is guaranteed to contain at least 75% of the observations. Conversely, the larger the standard deviation, the more likely it is that an observation will not be close to the mean. From the point of view of a manufacturer, reduction in variability of some product characteristic would correspond to an increase of consistency of the product. From the point of view of a financial manager, variability of a portfolio's return is referred to as volatility.
Note that Chebychev's Theorem applies to all data and therefore must
be conservative. In many situations the actual percentages contained within
these intervals are much higher than the minimums specified by this theorem.
If the shape of the data histogram is known, then better results can be
given. In particular, if it is known that the data histogram is approximately
bell-shaped, then we can say
contains approximately 68%,
contains approximately 95%,
contains essentially all
of the data values. This set of results is called the empirical rule. Later in the course we will study the bell-shaped curve (known as the normal distribution) in more detail.
The relative position of an observation in a data set can be represented by
its distance from the mean expressed in terms of the s.d. That is,
Since z-scores are dimension-less, then we can compare the relative positions of observations from different populations or samples by comparing their respective z-scores. For example, directly comparing the heights of a husband and wife would not be appropriate since males tend to be taller than females. However, if we knew the means and s.d.'s of males and females, then we could compare their z-scores. This comparison would be more meaningful than a direct comparison of their heights.
If the data histogram is approximately bell-shaped, then essentially all values should be within 3 s.d.'s of the mean, which is an interval of width 6 s.d.'s. A small number of observations that are unusually large or small can greatly inflate the s.d. Such observations are referred to as outliers. Identification of outliers is important, but this can be difficult since they will distort the mean and the s.d. For that reason, we can't simply use or for this purpose. We instead make use of some relationships between quartiles and the s.d. of bell-shaped data. In particular, if the data histogram is approximately bell-shaped, then . This relationship can be used to define a robust estimate of the s.d. which is then used to identify outliers. Observations that are more than from the nearest quartile are considered to be outliers. Boxplots in R are constructed so that the box edges are at the quartiles, the median is marked by a line within the box, and this the box is extended by whiskers indicating the range of observations that are no more than 1.5(IQR) from the nearest quartile. Any observations falling outside this range are plotted with a circle. For example, the following plot shows boxplots of mileage for each automobile type.
R Notes. The data set
is used to illustrate Chebychev's Theorem and the empirical rule. This is a csv file that contains two columns: BirthWeight gives weight of babies born to 1226 mothers and Smoker indicates whether or not the mother was a smoker.
# import data into R BW = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/BirthwtSmoke.csv",header=TRUE,sep=",") # note that Smoker is automatically converted to a factor # obtain mean and s.d. for all babies allBirthWeights = BW[,"BirthWeight"] meanAllWeights = mean(allBirthWeights) sdAllWeights = sd(allBirthWeights) # construct histogram of all weights hist(allBirthWeights, main="Histogram of Birth Weights\nAll Mothers included",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meanAllWeights + 2*c(-1,1)*sdAllWeights cat("At least 3/4 of birth weights are in the interval\n") cat(paste("[",round(cheb.int,1),", ", round(cheb.int,1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval allprop = mean(allBirthWeights > cheb.int & allBirthWeights < cheb.int) cat(paste("Actual proportion in this interval is",round(allprop,3)),"\n")
Next repeat this separately for mothers who smoke and mothers who don't smoke.
# extract weights for mothers who smoked smokeBirthWeights = allBirthWeights[BW$Smoker == "Yes"] meanSmokeWeights = mean(smokeBirthWeights) sdSmokeWeights = sd(smokeBirthWeights) # construct histogram of smoke weights hist(smokeBirthWeights, main="Histogram of Birth Weights: Smoking Mothers",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meanSmokeWeights + 2*c(-1,1)*sdSmokeWeights cat("At least 3/4 of birth weights from mothers who smoked are in the interval\n") cat(paste("[",round(cheb.int,1),", ", round(cheb.int,1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval smokeprop = mean(smokeBirthWeights > cheb.int & smokeBirthWeights < cheb.int) cat(paste("Actual proportion in this interval is",round(smokeprop,3)),"\n") # extract weights for mothers who did not smoke nonSmokeBirthWeights = allBirthWeights[BW$Smoker == "No"] meannonSmokeWeights = mean(nonSmokeBirthWeights) sdnonSmokeWeights = sd(nonSmokeBirthWeights) # construct histogram of non smoker weights hist(nonSmokeBirthWeights, main="Histogram of Birth Weights: Non-smoking Mothers",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meannonSmokeWeights + 2*c(-1,1)*sdnonSmokeWeights cat("\nAt least 3/4 of birth weights from mothers who did not smoke are in the interval\n") cat(paste("[",round(cheb.int,1),", ", round(cheb.int,1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval nonsmokeprop = mean(nonSmokeBirthWeights > cheb.int & nonSmokeBirthWeights < cheb.int) cat(paste("Actual proportion in this interval is",round(nonsmokeprop,3)),"\n") # now create graphic with both histograms aligned vertically # use same x-axis limits to make them comparable png("WeightHists.png",width=600,height=960) par(mfrow=c(2,1),oma=c(1,0,0,0)) Smoke.tab = table(BW$Smoker) hist(smokeBirthWeights, main="",col="cyan",xlab="Birth weight",xlim=range(allBirthWeights)) title(sub=paste("Smoking Mothers: n =",Smoke.tab["Yes"])) mtext("Histogram of Birth Weights",outer=TRUE,cex=1.2,font=2,line=-2) hist(nonSmokeBirthWeights, main="",col="cyan",xlab="Birth weight",xlim=range(allBirthWeights)) title(sub=paste("Non-smoking Mothers: n =",Smoke.tab["No"])) graphics.off()
A more effective way to visualize the differences in birth weights between mothers who smoke and those who do not is to use boxplots. These can be obtained through the plot() function. This function is what is referred to in R as a generic function. For this data what we would like to show is how birth weights depend on smoking status of mothers. We can do this using the formula interface of plot() as follows.
plot(BirthWeight ~ Smoker, data=BW)The first argument is the formula which can be read as: BirthWeight depends on Smoker. The data=BW argument tells R that the names used in the formula are variables in a data frame named BW. In this case the response variable BirthWeight is a numeric variable and the independent variable Smoker is a factor. For this type of formula plot() generates separate boxplots for each level of the factor. The box contains the middle 50% of the responses for a group (lower quartile - upper quartile) and the middle line within the box represents the group mean. The dashed lines and whisker represent a robust estimate of a 95% coverage interval derived from the median and inter-quartile range instead of the mean and s.d. Now let's create a stand-alone script that makes this plot look better by adding color, a title, and group sizes.
BW = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/BirthwtSmoke.csv",header=TRUE,sep=",") bw.col = c("SkyBlue","orange") png("BirthWeightBox.png",width=600,height=600) plot(BirthWeight ~ Smoker, data=BW, col=bw.col,ylab="Birth Weight") title("Birth Weights vs Smoking Status of Mothers") Smoke.tab = table(BW$Smoker) axis(side=1, at=seq(2), labels=paste("n=",Smoke.tab, sep=""), tick=FALSE, line=1) graphics.off()