The main importance of the normal distribution is associated with the
**Central Limit Theorem**. This theoem was originally derived as a large
sample approximation for the binomial distribution when *n* is large
and *p* is not extreme. In this case we may approximate the binomial
distribution function by the normal distribution with mean and standard
deviation
.

Suppose for example that in a very large population of voters, 48% favor
Candidate A for president, and that a sample of 500 is randomly selected from
this population. What is the probability that more than 250 in the sample will
favor Candidate A? We can model the number in the sample who favor Candidate A
with a binomial distribution with *n=500* and *p=0.48*. Since
*n* is large, we can approximate this distribution with a normal
distribution with mean
and standard deviation
. Since the binomial is a discrete
distribution, we can improve this approximation slightly by extending the
interval of values whose probability we wish obtain by 0.5 at each end of the
interval. For example, if we want to find , then we approximate it
by
, where *X* has the appropriate approximate
normal distribution. Similarly,

Therefore, from the table of areas under the normal curve, we obtain

Note that we could also express this event in terms of the sample proportion who favor Candidate A. Let denote the sample proportion. Then the probability we obtained above could be expressed as . Since is a linear function of , then we can use the normal distribution with mean and standard deviation to approximate the distribution of . Note that the standard deviation can be obtained directly as .

The **Central Limit Theorem** extends this result to a sampling
situation in which a sample of size is randomly selected from a very large
population with mean and standard deviation . Let
denote the mean of this sample. We can treat the sample mean as a random
variable that is the numerical value associated with the particular sample we
obtain when we perform the sampling experiment. The *Central Limit
Theorem* states that the distribution of this random variable is approximately
a normal distribution with mean and standard deviation
.
Suppose we looked at every possible sample of size that could be obtained
from the population, and we computed the sample mean for each of these samples.
What the **CLT** implies is that the histogram of all these sample means
would be approximately a normal curve with mean and standard deviation
. The following plots illustrate this.

Note that there is less asymmetry in the histogram of with
than in the population histogram, but some asymmetry still remains.
However, that asymmetry is not present in the histograms corresponding to the
larger sample sizes. Note also that the variability decreases with increasing
sample size. This theorem holds for any distribution, but the more
*non-normal* the distribution, the larger *n* must be for the
distribution of to be close to the normal distribution.
However, if the population distribution is itself a normal
distribution, then the Central Limit Theorem holds for all .

One remaining question that will also be applicable to methods discussed later
is the problem of determining how far a data set is from normality. This is
accomplished most commonly by a *Quantile-Quantile* plot. Let *n*
denote the sample size and let
. Then represents the
quantiles of the ordered values of the data. That is, represents, up to
a correction factor, the proportion of the sample that is at or below the
ordered value of the sample. Now let
. Then
represents the z-score such that the area below it equals the proportion of the
sample that is at or below the data value corresponding to the ordered
value. If the data has a normal distribution, then these points should fall on
a line with slope equal to the s.d. and intercept equal to the mean. The
following plots show quantile-quantile plots for four distributions: normal,
heavy-tailed, very heavy-tailed, and asymmetric.

2018-02-14