It is possible to have two very different datasets with the same means and medians. For that reason, measures of the middle are useful but limited. Another important attribute of a dataset is its dispersion or variability about its middle. The most useful measures of dispersion are the range, percentiles, and the standard deviation. The range is the difference between the largest and the smallest data values. Therefore, the more spread out the data values are, the larger the range will be. However, if a few observations are relatively far from the middle but the rest are relatively close to the middle, the range can give a distorted measure of dispersion.
Percentiles are positional measures for a dataset that enable one to
determine the relative standing of a single measurement within the dataset.
In particular, the
is defined to be a number such that
of the observations are less than or equal to that number and
are greater than that number. So, for example, an observation that
is at the
is less than only 25% of the
data. In practice, we often cannot satisfy the definition exactly. However, the
steps outlined below at least satisfies the spirit of the definition.
The
percentile is the median and partitions the data into a lower
half (below median) and upper half (above median). The
,
,
percentiles are referred to as quartiles. They partition the
data into 4 groups with 25% of the values below the
percentile (lower
quartile), 25% between the lower quartile and the median, 25% between the
median and the
percentile (upper quartile), and 25% above the upper
quartile. The difference between the upper and lower quartiles is referred to as
the inter-quartile range. This is the range of the middle 50% of the
data.
The third measure of dispersion we will consider here is associated with the
concept of distance between a number and a set of data. Suppose we are
interested in a particular dataset and would like to summarize the information
in that data with a single value that represents the closest number to
the data. To accomplish this requires that we first define a measure of
distance between a number and a dataset. One such measure can be defined as the
total distance between the number and the values in the dataset. That
is, the distance between a number c and a set of data values,
, would be
An alternative measure of distance between a number and a set of data that
is widely used and does have a unique solution is defined by,
The mean is the closest single number to the data when we define distance by
the square of the deviation between the number and a data value. The
average distance between the data and the mean is referred to as the
variance of the data. We make a notational distinction and a minor
arithmetic distinction between variance defined for populations and variance
defined for samples. We use
The standard deviation of a dataset can be interpreted by Chebychev's Theorem:
for anyFor example, the mean of the Mileage data is 24.583 and the standard deviation is 4.79. Therefore, at least 75% of the cars in this dataset have weights between, the proportion of observations within the interval
is at least
.
Note that Chebychev's Theorem applies to all data and therefore must
be conservative. In many situations the actual percentages contained within
these intervals are much higher than the minimums specified by this theorem.
If the shape of the data histogram is known, then better results can be
given. In particular, if it is known that the data histogram is approximately
bell-shaped, then we can say
contains approximately 68%,
contains approximately 95%,
contains essentially all
of the data values. This set of results is called the empirical rule.
Later in the course we will study the bell-shaped curve (known as the normal
distribution) in more detail.
The relative position of an observation in a data set can be represented by
its distance from the mean expressed in terms of the s.d. That is,
Since z-scores are dimension-less, then we can compare the relative positions of observations from different populations or samples by comparing their respective z-scores. For example, directly comparing the heights of a husband and wife would not be appropriate since males tend to be taller than females. However, if we knew the means and s.d.'s of males and females, then we could compare their z-scores. This comparison would be more meaningful than a direct comparison of their heights.
If the data histogram is approximately bell-shaped, then essentially all values
should be within 3 s.d.'s of the mean, which is an interval of width 6 s.d.'s.
A small number of observations that are unusually large or small can greatly
inflate the s.d. Such observations are referred to as outliers. Identification
of outliers is important, but this can be difficult since they will distort the
mean and the s.d. For that reason, we can't simply use
or
for this purpose.
We instead make use of some relationships between quartiles and the s.d. of
bell-shaped data. In particular, if the data histogram is approximately
bell-shaped, then
. This relationship can be used to
define a robust estimate of the s.d. which is then used to identify outliers.
Observations that are more than
from the nearest quartile are
considered to be outliers. Boxplots in R are constructed so that the box edges
are at the quartiles, the median is marked by a line within the box, and this
the box is extended by whiskers indicating the range of observations that are
no more than 1.5(IQR) from the nearest quartile. Any observations falling
outside this range are plotted with a circle. For example, the following plot
shows boxplots of mileage for each automobile type.
See sections 4.3, 4.4 in the textbook for details and addtional examples.