next up previous
Next: Simulation of confidence intervals Up: Estimation Previous: Large Sample Estimation of

Estimation of a Population Mean

The results of the previous section are derived from the Central Limit Theorem. We can use similar methods to estimate the mean of a population. We will first consider this estimation problem when the population has a normal distribution, and then we will examine the extension of these methods to populations that are not necessarily normally distributed.

Recall that if the population has a normal distribution with mean $\mu$ and standard deviation $\sigma$, then the distribution of $\overline{X}$ is $N(\mu,\sigma/\sqrt{n})$. This implies that we can use $\overline{X}$ as an estimate of $\mu$. The error of estimation is then $\overline{X}-\mu$, and we can make the following probability statement about this error,

\begin{displaymath}
P(\vert\overline{X}-\mu\vert > \frac{z_{\alpha/2}\sigma}{\sqrt{n}}) = P(\vert Z\vert > z_{\alpha/2}) = \alpha,
\end{displaymath}

where $z_{\alpha/2}$ is the z-score such that the area to the right of $z_{\alpha/2}$ under the normal curve is $\alpha/2$. We use $\alpha/2$ so that the total area in both extremes is $\alpha$. Therefore, the probability that the error of estimation exceeds $z_{\alpha/2}\sigma/\sqrt{n}$ is $\alpha$ and so, a $1-\alpha$ confidence interval for the population mean is

\begin{displaymath}
\overline{X} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}.
\end{displaymath}

The problem here is that this confidence interval depends on $\sigma$, the population standard deviation. In most situations, $\sigma$ is unknown as well as $\mu$. Sometimes we have prior information available that gives an upper bound for $\sigma$, $\sigma\le\sigma_0$, which can be incorporated into the confidence interval,

\begin{displaymath}
\overline{X} \pm z_{\alpha/2}\frac{\sigma_0}{\sqrt{n}}.
\end{displaymath}

Situations where no such upper bound is available require that we estimate $\sigma$ with the sample standard deviation. However, using $s$ in place of $\sigma$ changes the sampling distribution of $\overline{X}$. What is required is to determine the distribution of

\begin{displaymath}
t_n = \frac{\overline{X}-\mu}{s/\sqrt{n}}.
\end{displaymath}

This problem was solved around 100 years ago by a statistician named William Gossett, who solved it while working for Guinness brewery. Because of non-disclosure agreements in his employment contract, Gossett had to publish his work under the pseudonym Student. For this reason, the distribution of $t_n$ when $X_1,\cdots,X_n$ is a random sample from a normal distribution is called Student's t distribution. This distribution is similar to the standard normal distribution and represents an adjustment to the sampling distribution of $\overline{X}$ caused by replacing the constant $\sigma$ with a random variable $s$. As the sample size increases, $s$ becomes a better estimate of $\sigma$, and so less adjustment is required. Therefore, the t-distribution depends on the sample size. This dependence is expressed by a function of the sample size called degrees of freedom, which for this problem is $n-1$. That is, the sampling distribution of $t_n$ is a t-distribution with n-1 degrees of freedom. A plot that compares several t-distributions with the standard normal distribution is given below. Note that the t-distribution is symmetric and has relatively more area in the extremes and less area in the central region compared to the standard normal distribution. Also, as the degrees of freedom increases, the t-distribution converges to the standard normal distribution.

Image stat5311est3

We can now make use of Gossett's result to obtain a confidence interval for $\mu$,

\begin{displaymath}
\overline{X} \pm t_{n-1,\alpha/2}\frac{s}{\sqrt{n}},
\end{displaymath}

where $t_{n-1,\alpha/2}$ is the value from the t-distribution with n-1 degrees of freedom such that the area to the right of this value is $\alpha/2$. The interpretation of this interval is that it contains reasonable values for the population mean, reasonable in the sense that the probability that the interval does not contain the mean is $\alpha$.

The probability statement associated with this confidence interval,

\begin{displaymath}
P\left(\overline{X} - t_{n-1,\alpha/2}\frac{s}{\sqrt{n}} \le...
...e{X} + t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}\right) = 1 - \alpha,
\end{displaymath}

appears to imply that the mean $\mu$ is the random element of this statement. However, that is incorrect; what is random is the interval itself. This is illustrated by the following graphics. The first simulates the selection of 200 random samples each of size 25 from a population that has a normal distribution and the second performs the simulation with samples of size 100. Each vertical bar represents the confidence interval associated with one of these random samples. Green bars contain the actual mean and red bars do not. Note that the increased sample size does not change the probability that an interval contains the mean. Instead, what is different about the second graphic is that the confidence intervals are shorter than the intervals based on samples of size 25.

Image stat5311est5a

Image stat5311est5b

Sample size determination. If our estimate must satisfy requirements both for the level of confidence and for the precision of the estimate, then it is necessary to have some prior information that gives a bound on $\sigma$ or an estimate of $\sigma$. Let $\sigma_0$ denote this bound or estimate, and let $e$ denote the required precision. Then the confidence interval must have the form, $\overline{X}\pm e$, which implies that

\begin{displaymath}
\frac{z_{\alpha/2}\sigma_0}{\sqrt{n}} = e \Longleftrightarrow n =
\left(\frac{z_{\alpha/2}\sigma_0}{e}\right)^2.
\end{displaymath}

Example. A random sample of 22 existing home sales during the last month showed that the mean difference between list price and sales price was $4580 with a standard deviation of $1150. Assume that the differences between list and sales prices have approximately a normal distribution and construct a 95% confidence interval for the mean difference for all existing home sales. What would you say if the mean difference between list and sales prices for the same month last year had been $5500? Suppose you wish to estimate this mean to within $250 with 99% confidence. What sample size would be required if you use the standard deviation of this sample as an estimate of $\sigma$?
Solution. The confidence interval has the form

\begin{displaymath}
4580 \pm t_{21,.025}\frac{1150}{\sqrt{22}}.
\end{displaymath}

A table of t-values is given in the text on page B10. It is formatted differently than the table of normal curve areas. In the t-table, degrees of freedom are in the left-hand margin and tail areas are in the top margin. This table gives $t_{21,.025}=2.080$, and so the confidence interval is

\begin{displaymath}
4580 \pm (2.080)(1150)/\sqrt{22} \Longleftrightarrow 4580 \pm 510 \Longleftrightarrow [4070,5090].
\end{displaymath}

The interpretation of this interval is that it contains reasonable values for the population mean, reasonable in the sense that we are risking a 5% chance that the actual population mean is not one of these values. If the mean difference between list and sales prices for the same month last year had been $5500, then we could say that the difference between list and sales price this year is less than last year since all of the reasonable values for this year's mean difference are lower than last year's mean. There is a risk of 5% that this conclusion is wrong. Note that the precision of this confidence interval is 510 with 95% confidence. If we require a precision of 250 with 99% confidence, then we must use a larger sample size. If the sample standard deviation, $s=1150$, is used as an estimate of the $\sigma$ for the purpose of sample size determination, then

\begin{displaymath}
n = \left(\frac{z_{\alpha/2}\sigma_0}{e}\right)^2,
\end{displaymath}

where $\alpha = .01/2 = .005$, $\sigma_0 = 1150$, and $e=250$. Note that the last row of the t-table with degrees of freedom equal to infinity corresponds to the standard normal distribution. Therefore, we can use this row to find the required z-score. This gives

\begin{displaymath}
n = \left(\frac{(2.576)(1150)}{250}\right)^2 = 11.85^2 = 141.
\end{displaymath}

So a sample of size 141 would be required to meet these specifications. The actual precision attained by a confidence interval based on a sample of this size may not have a precision that is very close to 250 if the sample standard deviation in our preliminary sample of size 22 is not a good estimate of the actual population standard deviation.

Since the results discussed above are based on the Central Limit Theorem, we can apply them in the same way to the problem of estimating the mean of a population that does not necessarily have a normal distribution. This would lead to the same confidence interval for $\mu$,

\begin{displaymath}
\overline{X} \pm t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}.
\end{displaymath}

The only difference is that such an interval would only be valid if the sample size is sufficiently large for the Central Limit Theorem to be applicable. Some caution must be used here, since the definition of sufficiently large depends on the distribution of the population.


next up previous
Next: Simulation of confidence intervals Up: Estimation Previous: Large Sample Estimation of
Larry Ammann
2013-12-17