next up previous
Next: Large Sample Approximations Up: Continuous Random Variables Previous: Continuous Random Variables

Normal Distribution

The normal distribution, also known as the Bell Curve, has been used (and abused) as a model for a wide variety of phenomena to the point that some have the impression that any data that does not fit this model is in some way abnormal. That is not the case. The name normal distribution comes from the title of the paper Carl Friedrich Gauss wrote that first described the mathematical properties of the bell curve, ``On the Normal Distribution of Errors''. For this reason, the distribution is sometimes referred to as the gaussian distribution. Perhaps that name would be less misleading. The main importance of this model comes from the central role it plays in the behavior of many statistics that are derived from large samples.

The normal distribution represents a family of distribution functions, parametrized by the mean and standard deviation, denoted by $N(\mu,\sigma)$. The density function for this distribution is

\begin{displaymath}
f(x;\mu,\sigma) = (2\pi\sigma^2)^{-1/2}\exp\{-(x-\mu)^2/(2\sigma^2)\}.
\end{displaymath}

The mean is referred to as a location parameter since it determines the location of the peak of the curve. The standard deviation is referred to as a scale parameter since it determines how spread out or concentrated the curve is. The plots below illustrate these properties. In the first plot, the means differ but the standard deviations are all the same. In the second plot, both the means and the standard deviations differ.

Image stat3355norm1 Image stat3355norm2

Probability that a continuous random variable is contained within an interval is modeled by the area under the curve corresponding to the interval. Suppose for example we have a random variable that has a $N(50,5)$ distribution and we are interested in the probability that this r.v. takes a value between 45 and 60. The problem now is to determine this area. Unfortunately (or perhaps fortunately from the point of view of students) the normal density function does not have an explicit integral. This implies that we must either use a set of tabulated values to obtain areas under the curve or use a computer routine to determine the areas. One property satisfied by the family of normal distributions is closure under linear transformations. That is, if $X\sim N(\mu,\sigma)$, and if $Y=a+bX$, then $Y\sim N(a+b\mu,\vert b\vert\sigma)$. We can make use of this property by noting that

\begin{displaymath}
Z = \frac{X-\mu}{\sigma} = -\frac{\mu}{\sigma} + \frac{1}{\sigma}X
\end{displaymath}

has a $N(0,1)$ distribution. This distribution is referred to as the standard normal distribution, and the value of Z corresponding to X is referred to as the standardized score or Z-score for X. This property implies that the probability of any interval can be transformed into a probability involving the standard normal distribution. The interpretation of the Z-score can seen by expressing X in terms of Z,

\begin{displaymath}
X = \mu + Z\sigma.
\end{displaymath}

This shows that the z-score represents the number of standard deviations X is from its mean.

For example, if $X\sim N(50,5)$, then

\begin{eqnarray*}
P(45<X<60) &=& P(\frac{45-50}{5} < \frac{X-50}{5} < \frac{60-50}{5})\\
&=& P(-1 < Z < 2).
\end{eqnarray*}

Image stat3355norm3

As can be seen by comparing these two plots, the areas for $P(45<X<60)$ and $P(-1<Z<2)$ are the same. Therefore, it is only necessary to tabulate areas for the standard normal distribution. The textbook contains such a table on page 789. This table gives areas under the standard normal curve below z for $z>0$. This table requires an additional property of normal distributions called symmetry:

\begin{displaymath}
P(Z < -z) = P(Z > z),\ \ P(0<Z<z) = P(-z<Z<0).
\end{displaymath}

Image stat3355norm5

Example. Suppose a questionnaire designed to assess employee satisfaction with working conditions is given to the employees of a large corporation, and that the scores on this questionnaire are approximately normally distributed with mean 120 and standard deviation 18.
a) Find the proportion of employees who scored below 150.
b) Find the proportion of employees who scored between 140 and 160.
c) What proportion scored above 105?
d) What proportion scored between 90 and 145?
These areas are represented in the plots given below.
e) 15% of employees scored below what value?

Image stat3355norm6

Solutions
a) First transform to $N(0,1)$.

\begin{displaymath}
z = \frac{150-120}{18} = 1.67,
\end{displaymath}


\begin{displaymath}
P(X < 150) = P(Z < 1.67).
\end{displaymath}

From the table on the inside back cover of the text, the area below 1.67 is 0.9525. Therefore,

\begin{displaymath}
P(X<150) = P(Z<1.67) = 0.9525.
\end{displaymath}

b) Transform to $N(0,1)$.

\begin{eqnarray*}
z_1 &=& \frac{140-120}{18} = 1.11\\
z_2 &=& \frac{160-120}{18} = 2.22.
\end{eqnarray*}

In this case we must subtract the area below 1.11 from the area below 2.22. From the table these areas are, respectively, .8665 and .9868. This gives

\begin{displaymath}
P(140 < X < 160) = P(1.11 < Z < 2.22) = 0.9868 - 0.8665 = 0.1203.
\end{displaymath}

c) Transform to $N(0,1)$.

\begin{displaymath}
z = \frac{105-120}{18} = -0.83.
\end{displaymath}

The symmetry property of the normal distribution implies that the area above -0.83 is the same as the area below 0.83, which we get from the table.

\begin{displaymath}
P(X>105) = P(Z> -0.83) = P(Z<0.83) = 0.7967.
\end{displaymath}

d) Transform to $N(0,1)$.

\begin{eqnarray*}
z_1 &=& \frac{90-120}{18} = -1.67\\
z_2 &=& \frac{145-120}{18} = 1.39\\
\end{eqnarray*}

The area we require is the difference between the area below 1.39 and the area below -1.67. By symmetry, the area below -1.67 is the same as the area above 1.67.

\begin{eqnarray*}
P(90<X<145) &=& P(Z<1.39) - P(Z< -1.67)\\
&=& 0.9177 - [1 - P(Z<1.67)]\\
&=& 0.9177 - [1 - 0.9525]\\
&=& 0.8702.
\end{eqnarray*}

e) This problem is different than the others because we are given an area and must use this to determine the appropriate value. The first step is to determine on which side of the mean the required value is located. This is determined by two quantities: whether the area is less than 0.5 or greater than 0.5, and the direction relative to the required value occupied by the specified area. In this case, the area (15%=0.15) is less than 0.5 and the direction is specified by scored below what value. These imply that the required value must be less than the mean. A picture of this area is given below. To answer this question, we first answer the corresponding question for the standard normal distribution. What z-value has an area of 0.15 below it? This z-value must be negative since the area is less than 0.15 and the direction is below (or to the left of) the required value. Since the table gives areas below z, the area we must find in the table is $1 - 0.15 = 0.85$. The closest area in the table to 0.85 is 0.8508 which corresponds to a z-score of 1.04. Since the z-score for this problem is negative, then the answer to this question for the standard normal distribution is $z = -1.04$. Finally, we must convert this z-score to the x-value,

\begin{displaymath}
x = \mu + z\sigma = 120 + (-1.04)(18) = 101.28.
\end{displaymath}

If you check this answer by finding the area below 101.28, you will see that the steps we just followed are the same steps we used to find areas but applied in reverse order. Also note that the value of 101.28 represents the $15^{th}$ percentile of this normal distribution. Other percentiles can be obtained similarly.

Image stat3355norm7

Since z-scores represent the number of standard deviations from the mean, and since they are directly associated with percentiles, they can be used to determine the relative standing of an observation from a normally distributed population. In particular, consider the following three intervals: $\mu\pm\sigma$, $\mu\pm 2\sigma$, and $\mu\pm 3\sigma$. After converting these intervals to z-scores, they become, respectively, (-1,1), (-2,2), and (-3,3). Because of the symmetry property, the probabilities for these intervals are,

\begin{eqnarray*}
P(\mu - \sigma <X< \mu + \sigma) &=& P(-1<Z<1) = 2P(0<Z<1) = 2...
... <X< \mu + 3\sigma) &=& P(-3<Z<3) = 2P(0<Z<3) = 2(.4987) = .9974
\end{eqnarray*}

This is the basis for the empirical rule: if a set of data has a histogram that is approximately bell-shaped, then approximately 68% of the measurements are within 1 standard deviation of the mean, approximately 95% are within 2 standard deviations of the mean, and essentially all (makes more sense than approximately 99.74%) are within 3 standard deviations of the mean.

Suppose that in the previous example an employee scored 82 on the employee satisfaction survey. The z-score for 82 is (82-120)/18 = -2.11. So this score is more than 2 standard deviations below the mean. Since 95% of the scores are within 2 standard deviations of the mean, this is a relatively low score. We could be more specific by determining the percentile rank for this score. From the table of normal curve areas, the area below 2.11 is 0.9826, so the area below $z = -2.11$ is $1 - 0.9826 = 0.0174$. That is, only 1.74% of those who took this questionnaire scored this low or lower.


next up previous
Next: Large Sample Approximations Up: Continuous Random Variables Previous: Continuous Random Variables
Larry Ammann
2013-12-17