It is possible to have two very different datasets with the same means and
medians. For that reason, measures of the middle are useful but limited.
Another important attribute of a dataset is its dispersion or variability
about its middle. The most useful measures of dispersion are the
**range**, **percentiles**, and the **standard deviation**.
The **range** is the difference between the largest and the smallest
data values. Therefore, the more spread out the data values are, the larger
the range will be. However, if a few observations are relatively far from
the middle but the rest are relatively close to the middle, the range can
give a distorted measure of dispersion.

**Percentiles** are positional measures for a dataset that enable one to
determine the relative standing of a single measurement within the dataset.
In particular, the is defined to be a number such that
of the observations are less than or equal to that number and
are greater than that number. So, for example, an observation that
is at the
is less than only *25%* of the
data. In practice, we often cannot satisfy the definition exactly. However, the
steps outlined below at least satisfies the spirit of the definition.

- Order the data values from smallest to largest; include ties.
- Determine the position,

- The is located between the and the
ordered value. Use the fractional part of the position,
*.ddd*as an interpolation factor between these values. If*k = 0*, then take the smallest observation as the percentile and if , then take the largest observation as the percentile. For example, if*n = 75*and we wish to find the percentile, then the position is . The percentile is then located between the and ordered values. Suppose that these are 57.8 and 61.3, respectively. Then the percentile would be

The percentile is the median and partitions the data into a lower
half (below median) and upper half (above median). The , ,
percentiles are referred to as *quartiles*. They partition the
data into 4 groups with 25% of the values below the percentile (lower
quartile), 25% between the lower quartile and the median, 25% between the
median and the percentile (upper quartile), and 25% above the upper
quartile. The difference between the upper and lower quartiles is referred to as
the *inter-quartile range*. This is the range of the middle 50% of the
data.

The third measure of dispersion we will consider here is associated with the
concept of distance between a number and a set of data. Suppose we are
interested in a particular dataset and would like to summarize the information
in that data with a single value that represents the *closest* number to
the data. To accomplish this requires that we first define a measure of
distance between a number and a dataset. One such measure can be defined as the
*total distance between the number and the values in the dataset*. That
is, the distance between a number **c** and a set of data values,
, would be

It can be shown that the value that minimizes

An alternative measure of distance between a number and a set of data that
is widely used and does have a unique solution is defined by,

That is, the distance between a number

As we have already seen, the solution to this equation is . The graphic below gives a histogram of the Weight data with the distance function

The mean is the closest single number to the data when we define distance by
the square of the deviation between the number and a data value. The
*average distance* between the data and the mean is referred to as the
**variance** of the data. We make a notational distinction and a minor
arithmetic distinction between variance defined for populations and variance
defined for samples. We use

for population variances, and

for sample variances. Note that the unit of measure for the variance is the square of the unit of measure for the data. For that reason (and others), the square root of the variance, called the

Note that datasets in which the values tend to be far away from the middle have a large variance (and hence large standard deviation), and datasets in which the values cluster closely around the middle have small variance. Unfortunately, it is also the case that a data set with one value very far from the middle and the rest very close to the middle also will have a large variance.

The standard deviation of a dataset can be interpreted by **Chebychev's
Theorem**:

for any , the proportion of observations within the interval is at least .For example, the mean of the

Note that **Chebychev's Theorem** applies to all data and therefore must
be conservative. In many situations the actual percentages contained within
these intervals are much higher than the minimums specified by this theorem.
If the shape of the data histogram is known, then better results can be
given. In particular, if it is known that the data histogram is approximately
bell-shaped, then we can say

contains approximately 68%,

contains approximately 95%,

contains essentially all

of the data values. This set of results is called the **empirical rule**.
Later in the course we will study the bell-shaped curve (known as the normal
distribution) in more detail.

The relative position of an observation in a data set can be represented by
its distance from the mean expressed in terms of the s.d. That is,

and is referred to as the z-score of the observation. NPositive z-scores are above the mean, negative z-scores are below the mean. Z-scores greater than 2 are more than 2 s.d.'s above the mean. From Chebychev's theorem, at least 75% of observations in any dataset will have z-scores between -2 and 2

Since z-scores are dimension-less, then we can compare the relative positions of observations from different populations or samples by comparing their respective z-scores. For example, directly comparing the heights of a husband and wife would not be appropriate since males tend to be taller than females. However, if we knew the means and s.d.'s of males and females, then we could compare their z-scores. This comparison would be more meaningful than a direct comparison of their heights.

If the data histogram is approximately bell-shaped, then essentially all values
should be within 3 s.d.'s of the mean, which is an interval of width 6 s.d.'s.
A small number of observations that are unusually large or small can greatly
inflate the s.d. Such observations are referred to as outliers. Identification
of outliers is important, but this can be difficult since they will distort the
mean and the s.d. For that reason, we can't simply use
or
for this purpose.
We instead make use of some relationships between quartiles and the s.d. of
bell-shaped data. In particular, if the data histogram is approximately
bell-shaped, then
. This relationship can be used to
define a robust estimate of the s.d. which is then used to identify outliers.
Observations that are more than
from the nearest quartile are
considered to be outliers. Boxplots in **R** are constructed so that the box edges
are at the quartiles, the median is marked by a line within the box, and this
the box is extended by whiskers indicating the range of observations that are
no more than 1.5(IQR) from the nearest quartile. Any observations falling
outside this range are plotted with a circle. For example, the following plot
shows boxplots of mileage for each automobile type.

Note that this plot shows how the quantitative variable *Mileage* and the categorical
variable *Type* are related.

**R Notes**. The data set

http://www.utdallas.edu/~ammann/stat3355scripts/BirthwtSmoke.csv

is used to illustrate Chebychev's Theorem and the empirical rule. This is a csv file that contains
two columns: *BirthWeight* gives weight of babies born to 1226 mothers and *Smoker*
indicates whether or not the mother was a smoker.

# import data into R BW = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/BirthwtSmoke.csv",header=TRUE,sep=",") # note that Smoker is automatically converted to a factor # obtain mean and s.d. for all babies allBirthWeights = BW[,"BirthWeight"] meanAllWeights = mean(allBirthWeights) sdAllWeights = sd(allBirthWeights) # construct histogram of all weights hist(allBirthWeights, main="Histogram of Birth Weights\nAll Mothers included",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meanAllWeights + 2*c(-1,1)*sdAllWeights cat("At least 3/4 of birth weights are in the interval\n") cat(paste("[",round(cheb.int[1],1),", ", round(cheb.int[2],1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval allprop = mean(allBirthWeights > cheb.int[1] & allBirthWeights < cheb.int[2]) cat(paste("Actual proportion in this interval is",round(allprop,3)),"\n")

Next repeat this separately for mothers who smoke and mothers who don't smoke.

# extract weights for mothers who smoked smokeBirthWeights = allBirthWeights[BW$Smoker == "Yes"] meanSmokeWeights = mean(smokeBirthWeights) sdSmokeWeights = sd(smokeBirthWeights) # construct histogram of smoke weights hist(smokeBirthWeights, main="Histogram of Birth Weights: Smoking Mothers",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meanSmokeWeights + 2*c(-1,1)*sdSmokeWeights cat("At least 3/4 of birth weights from mothers who smoked are in the interval\n") cat(paste("[",round(cheb.int[1],1),", ", round(cheb.int[2],1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval smokeprop = mean(smokeBirthWeights > cheb.int[1] & smokeBirthWeights < cheb.int[2]) cat(paste("Actual proportion in this interval is",round(smokeprop,3)),"\n") # extract weights for mothers who did not smoke nonSmokeBirthWeights = allBirthWeights[BW$Smoker == "No"] meannonSmokeWeights = mean(nonSmokeBirthWeights) sdnonSmokeWeights = sd(nonSmokeBirthWeights) # construct histogram of non smoker weights hist(nonSmokeBirthWeights, main="Histogram of Birth Weights: Non-smoking Mothers",col="cyan") # now report application of Chebychev's Theorem # print line that gives the interval +- 2 s.d.'s from mean using paste function cheb.int = meannonSmokeWeights + 2*c(-1,1)*sdnonSmokeWeights cat("\nAt least 3/4 of birth weights from mothers who did not smoke are in the interval\n") cat(paste("[",round(cheb.int[1],1),", ", round(cheb.int[2],1),"]",sep=""),"\n") cat("Since histograph is approximately bell-shaped,\n") cat("we can say that approximately 95% will be in this interval.\n") # now count how many are in the interval nonsmokeprop = mean(nonSmokeBirthWeights > cheb.int[1] & nonSmokeBirthWeights < cheb.int[2]) cat(paste("Actual proportion in this interval is",round(nonsmokeprop,3)),"\n") # now create graphic with both histograms aligned vertically # use same x-axis limits to make them comparable png("WeightHists.png",width=600,height=960) par(mfrow=c(2,1),oma=c(1,0,0,0)) Smoke.tab = table(BW$Smoker) hist(smokeBirthWeights, main="",col="cyan",xlab="Birth weight",xlim=range(allBirthWeights)) title(sub=paste("Smoking Mothers: n =",Smoke.tab["Yes"])) mtext("Histogram of Birth Weights",outer=TRUE,cex=1.2,font=2,line=-2) hist(nonSmokeBirthWeights, main="",col="cyan",xlab="Birth weight",xlim=range(allBirthWeights)) title(sub=paste("Non-smoking Mothers: n =",Smoke.tab["No"])) graphics.off()

A more effective way to visualize the differences in birth weights between mothers who smoke and
those who do not is to use boxplots. These can be obtained through the **plot()** function.
This function is what is referred to in **R** as a generic function. For this data what we would
like to show is how birth weights depend on smoking status of mothers. We can do this using the
formula interface of **plot()** as follows.

plot(BirthWeight ~ Smoker, data=BW)The first argument is the formula which can be read as:

BW = read.table("http://www.utdallas.edu/~ammann/stat3355scripts/BirthwtSmoke.csv",header=TRUE,sep=",") bw.col = c("SkyBlue","orange") png("BirthWeightBox.png",width=600,height=600) plot(BirthWeight ~ Smoker, data=BW, col=bw.col,ylab="Birth Weight") title("Birth Weights vs Smoking Status of Mothers") Smoke.tab = table(BW$Smoker) axis(side=1, at=seq(2), labels=paste("n=",Smoke.tab, sep=""), tick=FALSE, line=1) graphics.off()

2018-10-18