next up previous
Next: Probability models and simulation Up: Continuous Random Variables Previous: Normal Distribution

Large Sample Approximations

The main importance of the normal distribution is associated with the Central Limit Theorem. This theoem was originally derived as a large sample approximation for the binomial distribution when n is large and p is not extreme. In this case we may approximate the binomial distribution function by the normal distribution with mean $np$ and standard deviation $\sqrt{np(1-p)}$.

Suppose for example that in a very large population of voters, 48% favor Candidate A for president, and that a sample of 500 is randomly selected from this population. What is the probability that more than 250 in the sample will favor Candidate A? We can model the number in the sample who favor Candidate A with a binomial distribution with n=500 and p=0.48. Since n is large, we can approximate this distribution with a normal distribution with mean $\mu = 500(.48) = 240$ and standard deviation $\sigma = \sqrt{500(.48)(.52)} = 11.2$. Since the binomial is a discrete distribution, we can improve this approximation slightly by extending the interval of values whose probability we wish obtain by 0.5 at each end of the interval. For example, if we want to find $P(N=230)$, then we approximate it by $P(229.5 < X < 230.5$, where X has the appropriate approximate normal distribution. Similarly,

\begin{eqnarray*}
P(N < a) &\approx& P(X < a-.5)\\
P(N \le a) &\approx& P(X < a...
...+.5 < X < b-.5)\\
P(a \le N \le b) &\approx& P(a-.5 < X < b+.5)
\end{eqnarray*}

Therefore, from the table of areas under the normal curve, we obtain

\begin{eqnarray*}
P(N>250) &\approx& P(X>250.5)\\
&=& P(Z > (250.5-240)/11.2)\\
&=& P(Z > 0.94)\\
&=& = 1 - 0.8264 = 0.1736.
\end{eqnarray*}

Note that we could also express this event in terms of the sample proportion who favor Candidate A. Let $\hat{p} = N/500$ denote the sample proportion. Then the probability we obtained above could be expressed as $P(\hat{p} > 0.5)$. Since $\hat{p}$ is a linear function of $N$, then we can use the normal distribution with mean $\mu = 240/500 = 0.48$ and standard deviation $\sigma = 11.2/500 = 0.022$ to approximate the distribution of $\hat{p}$. Note that the standard deviation can be obtained directly as $\sigma = \sqrt{p(1-p)/n} = \sqrt{(.48)(.52)/500} = 0.022$.

The Central Limit Theorem extends this result to a sampling situation in which a sample of size $n$ is randomly selected from a very large population with mean $\mu$ and standard deviation $\sigma$. Let $\overline{X}$ denote the mean of this sample. We can treat the sample mean as a random variable that is the numerical value associated with the particular sample we obtain when we perform the sampling experiment. The Central Limit Theorem states that the distribution of this random variable is approximately a normal distribution with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$. Suppose we looked at every possible sample of size $n$ that could be obtained from the population, and we computed the sample mean for each of these samples. What the CLT implies is that the histogram of all these sample means would be approximately a normal curve with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$. The following plots illustrate this.

Image stat3355norm8

Note that there is less asymmetry in the histogram of $\overline{X}$ with $n=10$ than in the population histogram, but some asymmetry still remains. However, that asymmetry is not present in the histograms corresponding to the larger sample sizes. Note also that the variability decreases with increasing sample size. This theorem holds for any distribution, but the more non-normal the distribution, the larger n must be for the distribution of $\overline{X}$ to be close to the normal distribution. However, if the population distribution is itself a normal distribution, then the Central Limit Theorem holds for all $n\ge 1$.

One remaining question that will also be applicable to methods discussed later is the problem of determining how far a data set is from normality. This is accomplished most commonly by a Quantile-Quantile plot. Let n denote the sample size and let $y = ((1:n) - .5)/n$. Then $y$ represents the quantiles of the ordered values of the data. That is, $y[i]$ represents, up to a correction factor, the proportion of the sample that is at or below the $i^{th}$ ordered value of the sample. Now let $x[i] = z_{y[i]}$. Then $x[i]$ represents the z-score such that the area below it equals the proportion of the sample that is at or below the data value corresponding to the $i^{th}$ ordered value. If the data has a normal distribution, then these points should fall on a line with slope equal to the s.d. and intercept equal to the mean. The following plots show quantile-quantile plots for four distributions: normal, heavy-tailed, very heavy-tailed, and asymmetric.

Image stat3355norm9


next up previous
Next: Probability models and simulation Up: Continuous Random Variables Previous: Normal Distribution
Larry Ammann
2013-12-17