next up previous
Next: Measures of Association Up: Numerical summaries of data Previous: Measures of Location

Measures of Dispersion

It is possible to have two very different datasets with the same means and medians. For that reason, measures of the middle are useful but limited. Another important attribute of a dataset is its dispersion or variability about its middle. The most useful measures of dispersion are the range, percentiles, and the standard deviation. The range is the difference between the largest and the smallest data values. Therefore, the more spread out the data values are, the larger the range will be. However, if a few observations are relatively far from the middle but the rest are relatively close to the middle, the range can give a distorted measure of dispersion.

Percentiles are positional measures for a dataset that enable one to determine the relative standing of a single measurement within the dataset. In particular, the $p^{th}\ \%ile$ is defined to be a number such that $p\%$ of the observations are less than or equal to that number and $(100-p)\%$ are greater than that number. So, for example, an observation that is at the $75^{th}\ \%ile$ is less than only 25% of the data. In practice, we often cannot satisfy the definition exactly. However, the steps outlined below at least satisfies the spirit of the definition.

  1. Order the data values from smallest to largest; include ties.
  2. Determine the position,

    \begin{displaymath}
k.ddd = 1 + \frac{p(n-1)}{100}.
\end{displaymath}

  3. The $p^{th}\ \%ile$ is located between the $k^{th}$ and the $(k+1)^{th}$ ordered value. Use the fractional part of the position, .ddd as an interpolation factor between these values. If k = 0, then take the smallest observation as the percentile and if $k = n$, then take the largest observation as the percentile. For example, if n = 75 and we wish to find the $35^{th}$ percentile, then the position is $1 + 35*74/100 = 26.9$. The percentile is then located between the $26^{th}$ and $27^{th}$ ordered values. Suppose that these are 57.8 and 61.3, respectively. Then the percentile would be

    \begin{displaymath}
57.8 + .9*(61.3-57.8) = 60.95.
\end{displaymath}

Note. Quantiles are equivalent to percentiles with the percentile expressed as a proportion ( $70^{th}\ \%ile$ is the $.70$ quantile).

The $50^{th}$ percentile is the median and partitions the data into a lower half (below median) and upper half (above median). The $25^{th}$, $50^{th}$, $75^{th}$ percentiles are referred to as quartiles. They partition the data into 4 groups with 25% of the values below the $25^{th}$ percentile (lower quartile), 25% between the lower quartile and the median, 25% between the median and the $75^{th}$ percentile (upper quartile), and 25% above the upper quartile. The difference between the upper and lower quartiles is referred to as the inter-quartile range. This is the range of the middle 50% of the data.

The third measure of dispersion we will consider here is associated with the concept of distance between a number and a set of data. Suppose we are interested in a particular dataset and would like to summarize the information in that data with a single value that represents the closest number to the data. To accomplish this requires that we first define a measure of distance between a number and a dataset. One such measure can be defined as the total distance between the number and the values in the dataset. That is, the distance between a number c and a set of data values, $X_i,\ 1\le i\le n$, would be

\begin{displaymath}
D(c) = \sum_{i=1}^n \vert X_i - c\vert.
\end{displaymath}

It can be shown that the value that minimizes D(c) is the median. However, this measure of distance is not widely used for several reasons, one of which is that this minimization problem does not always have a unique solution.

An alternative measure of distance between a number and a set of data that is widely used and does have a unique solution is defined by,

\begin{displaymath}
D(c) = \sum_{i=1}^n (X_i-c)^2.
\end{displaymath}

That is, the distance between a number c and the data is the sum of the squared distances between c and each data value. We can take as our single number summary the value of c that is closest to the dataset, i.e., the value of c which minimizes $D(c)$. It can be shown that the value that minimizes this distance is $c = \overline{X}$. This is accomplished by differentiating D(c) with respect to c and setting the derivative equal to 0.

\begin{displaymath}
0 = \frac{\partial}{\partial c} D(c) = \sum_{i=1}^n -2(X_i-c) = -2\sum_{i=1}^n (X_i-c).
\end{displaymath}

As we have already seen, the solution to this equation is $c=\overline{X}$. The graphic below gives a histogram of the Weight data with the distance function D(c) superimposed. This graph shows that the minimum distance occurs at the mean of Weight.

Image stat3355num5

The mean is the closest single number to the data when we define distance by the square of the deviation between the number and a data value. The average distance between the data and the mean is referred to as the variance of the data. We make a notational distinction and a minor arithmetic distinction between variance defined for populations and variance defined for samples. We use

\begin{displaymath}
\sigma^2 = \frac{1}{N}\sum_{i=1}^N(X_i-\mu)^2,
\end{displaymath}

for population variances, and

\begin{displaymath}
s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})^2,
\end{displaymath}

for sample variances. Note that the unit of measure for the variance is the square of the unit of measure for the data. For that reason (and others), the square root of the variance, called the standard deviation, is more commonly used as a measure of dispersion,

\begin{displaymath}
\sigma = \sqrt{\sum_{i=1}^N(X_i-\mu)^2/N},
\end{displaymath}


\begin{displaymath}
s = \sqrt{\sum_{i=1}^n(X_i-\overline{X})^2/(n-1)}.
\end{displaymath}

Note that datasets in which the values tend to be far away from the middle have a large variance (and hence large standard deviation), and datasets in which the values cluster closely around the middle have small variance. Unfortunately, it is also the case that a dataset with one value very far from the middle and the rest very close to the middle also will have a large variance. See sections 4.1, 4.2 in the textbook for details and examples.

The standard deviation of a dataset can be interpreted by Chebychev's Theorem:

for any $k>1$, the proportion of observations within the interval $\mu \pm k\sigma$ is at least $(1-1/k^2)$.
For example, the mean of the Mileage data is 24.583 and the standard deviation is 4.79. Therefore, at least 75% of the cars in this dataset have weights between $24.583-2*4.79=15.003$ and $24.583+2*4.79=34.163$. Chebychev's theorem is very conservative since it is applicable to every dataset. The actual number of cars whose Mileage falls in the interval (15.003,34.163) is 58, corresponding to 96.7%. Nevertheless, knowing just the mean and standard deviation of a dataset allows us to obtain a rough picture of the distribution of the data values. Note that the smaller the standard deviation, the smaller is the interval that is guaranteed to contain at least 75% of the observations. Conversely, the larger the standard deviation, the more likely it is that an observation will not be close to the mean. From the point of view of a manufacturer, reduction in variability of some product characteristic would correspond to an increase of consistency of the product. From the point of view of a financial manager, variability of a portfolio's return is referred to as volatility.

Note that Chebychev's Theorem applies to all data and therefore must be conservative. In many situations the actual percentages contained within these intervals are much higher than the minimums specified by this theorem. If the shape of the data histogram is known, then better results can be given. In particular, if it is known that the data histogram is approximately bell-shaped, then we can say
$\mu \pm \sigma$ contains approximately 68%,
$\mu \pm 2\sigma$ contains approximately 95%,
$\mu \pm 3\sigma$ contains essentially all
of the data values. This set of results is called the empirical rule. Later in the course we will study the bell-shaped curve (known as the normal distribution) in more detail.

The relative position of an observation in a data set can be represented by its distance from the mean expressed in terms of the s.d. That is,

\begin{displaymath}
z = \frac{x - \mu}{\sigma},
\end{displaymath}

and is referred to as the z-score of the observation. NPositive z-scores are above the mean, negative z-scores are below the mean. Z-scores greater than 2 are more than 2 s.d.'s above the mean. From Chebychev's theorem, at least 75% of observations in any dataset will have z-scores between -2 and 2

Since z-scores are dimension-less, then we can compare the relative positions of observations from different populations or samples by comparing their respective z-scores. For example, directly comparing the heights of a husband and wife would not be appropriate since males tend to be taller than females. However, if we knew the means and s.d.'s of males and females, then we could compare their z-scores. This comparison would be more meaningful than a direct comparison of their heights.

If the data histogram is approximately bell-shaped, then essentially all values should be within 3 s.d.'s of the mean, which is an interval of width 6 s.d.'s. A small number of observations that are unusually large or small can greatly inflate the s.d. Such observations are referred to as outliers. Identification of outliers is important, but this can be difficult since they will distort the mean and the s.d. For that reason, we can't simply use $\overline{X} \pm 2s$ or $\overline{X} \pm 3s$ for this purpose. We instead make use of some relationships between quartiles and the s.d. of bell-shaped data. In particular, if the data histogram is approximately bell-shaped, then $IQR \approx 1.35s$. This relationship can be used to define a robust estimate of the s.d. which is then used to identify outliers. Observations that are more than $1.5(IQR) \approx 2s$ from the nearest quartile are considered to be outliers. Boxplots in R are constructed so that the box edges are at the quartiles, the median is marked by a line within the box, and this the box is extended by whiskers indicating the range of observations that are no more than 1.5(IQR) from the nearest quartile. Any observations falling outside this range are plotted with a circle. For example, the following plot shows boxplots of mileage for each automobile type.

Image stat3355num9

See sections 4.3, 4.4 in the textbook for details and addtional examples.


next up previous
Next: Measures of Association Up: Numerical summaries of data Previous: Measures of Location
Larry Ammann
2014-09-16