next up previous
Next: Estimation of a Population Up: Estimation Previous: Estimation


Large Sample Estimation of a Population Proportion

Suppose we randomly select n individuals from the population of voters and let N denote the proportion in the sample who favor a particular candidate. Then $\hat{p}=N/n$ is our estimate of $\pi$. The value of this estimate depends on the individuals who are selected for the sample. To understand how we can make use of this fact to make a statement about estimation error, consider the following thought experiment (an experiment that we don't actually perform, but can think about). Suppose we select every possible sample of size n from the population and for each sample we obtain the sample proportion who favor this candidate. These estimates will vary from 0 to 1 and the actual sampling experiment we perform, selecting a random sample of size n and obtaining its sample proportion, is equivalent to randomly selecting one proportion from the population of proportions obtained from all possible samples of size n. Although we could not perform this experiment in reality, we can perform it mathematically. If we can determine the distribution of the population of all possible sample proportions, then we can use this distribution to make a probability statement about the estimation error. The Central Limit theorem states that if n is large, then the distribution of N is approximately a normal distribution with mean $n\pi$ and variance $n\pi(1-\pi)$. Therefore, the distribution of $\hat{p}=N/n$ has approximately a normal distribution with mean $\pi$ and variance $\pi(1-\pi)/n$ (see the plot below). This distribution is called the sampling distribution of $\hat{p}$. One of the properties of normal curves is that approximately 95% of a normally distributed population lies within 2 standard deviations of the mean. In this case that means that approximately 95% of all possible samples of size n have sample proportions that are within 2 standard deviations of their mean $\pi$. Therefore, when we randomly select our sample proportion from the population of all possible sample proportions, the probability is approximately 0.95 that the error of estimation, the difference between the estimate and the actual proportion, will be no more than 2 standard deviations, $2\sqrt{\pi(1-\pi)/n}$. This represents a bound on the error of estimation. It is not an absolute bound, but is a reasonable bound in the sense that there is only a 5% chance that the error of estimation will exceed this bound.

Image stat5311est1

For example, suppose we randomly select 500 voters and find that 260 of these voters favor this candidate. Then our estimate of the population proportion is $\hat{p} = 260/500 = 0.520$. We are about 95% certain that the error of this estimate is no more than $2\sqrt{\pi(1-\pi)/n}$. The problem that remains to be solved is that this error bound depends on the value of $\pi$, which is unknown. There are two approaches we can take to solve this problem. The first approach is to note that the function

\begin{displaymath}
h(\pi)=\sqrt{\pi(1-\pi)},\ 0\le \pi\le 1,
\end{displaymath}

is a bounded function of $\pi$ with upper bound $h(\pi)\le 0.5$. The plot below shows how this function depends on $\pi$.

Image stat5311est2

This implies that the bound on the error of estimation is at most

\begin{displaymath}
2(.5)/\sqrt{n} = 1/\sqrt{n}.
\end{displaymath}

Therefore, we can make the following statement about the proportion of voters who favor our candidate based on the information contained in our sample: the estimated proportion who favor our candidate is 0.520 and we are about 95% certain that this estimate is no more than

\begin{displaymath}
1/\sqrt{500} = 0.045
\end{displaymath}

from the actual population proportion. Another way of stating this is that we are about 95% certain that the population proportion is within the interval $0.520\pm 0.045$, that is, between 0.475 and 0.565.

This bound on the error of estimation of a population proportion is conservative in the sense that it does not depend on the actual population proportion. However, if $\pi$ is close to 0 or 1, then it will be too conservative because in this case, the value of $h(\pi)$ would be much smaller than the upper bound. It can be seen from the plot that if $.2\le \pi\le .8$, then $.4\le h(\pi)\le .5$, so the upper bound becomes too conservative when the population proportion is below .2 or above .8. In some situations, we may have prior information in the form of a bound on $\pi$ that allows us to place a bound on $h(\pi)$. Suppose, for example, that we wish to estimate the proportion of memory chips that do not meet specifications, and we know from past history that this proportion has never exceeded 0.15. In that case, we can say that

\begin{displaymath}
h(\pi)\le \sqrt{(.15)(.85)} = .357.
\end{displaymath}

If a sample of 400 memory chips is randomly selected from a production run, and it is found that 32 fail to meet specifications, then the estimated population proportion is $\hat{p} = .080$, and a bound on the error of estimation would be $2(.357)/\sqrt{400} = 0.036$. We could present these results as follows: The estimated proportion of memory chips that do not meet specifications is 0.080. With 95% certainty, this proportion could be as low as 0.044 or as high as 0.116.

If we do not have available any prior bounds on the population proportion, then we could use $\hat{p}$ in place of $\pi$ in the error bound. That is, the estimated bound on the error of estimation would be

\begin{displaymath}
2\sqrt{\hat{p}(1-\hat{p})/n}.
\end{displaymath}

One of the interpretations of the estimate of the proportion of voters who favor our candidate is that we are 95% confident that this proportion is between 0.475 and 0.565. This interval represents a range of reasonable values for the population proportion. The confidence level of 95% is determined by the use of 2 standard deviations for the error bound and the property of normal curves that approximately 95% of a population falls within 2 standard deviations from the mean. However, this also implies that there is a 5% chance that the estimation error is greater than the stated bound, or that there is a 5% chance that the interval does not contain the population proportion. If there are very serious consequences of reporting an error bound that turns out to be too small, then we should decide what is an acceptable risk that the error bound is too small. We can then use the appropriate number of standard deviations so that the risk is acceptably small. Suppose for example that we are willing to accept a risk of 1% that the error bound is too small or that the resulting interval of reasonable values does not include the population proportion. To accomplish this, we must find the z-score such that the area between -z and z is 0.99. To find this z-score, note that to area above z must be 0.005 and so the total area below z is 0.995. We can use the R quantile function qnorm for the normal distribution to obtain this value,

z = qnorm(.995)
This gives z=2.576 and so the 99% confidence interval is

\begin{displaymath}
(0.520\pm (2.576)\sqrt{(.52)(.48)/500}) \Longleftrightarrow (0.520\pm 0.058)
\Longleftrightarrow (0.462, 0.578).
\end{displaymath}

In this case we are 99% confident that the proportion of voters who favor our candidate is somewhere within this interval. Such intervals are called confidence intervals. To summarize the discussion above, a confidence interval for a population proportion based on a random sample of size n is

\begin{displaymath}
\hat{p} \pm z\hat{\sigma},
\end{displaymath}

where z is selected so that the area between -z and z is the required level of confidence, and $\hat{\sigma}$ is

\begin{displaymath}
\hat{\sigma} = \left\{ \begin{array}{ll}
\sqrt{p_0(1-p_0)/n...
...)/n},& \mbox{ if no prior bound is given }
\end{array} \right.
\end{displaymath}

Confidence intervals have two inter-related properties: the level of confidence and the precision as measured by the width of the confidence interval. These properties are inversely related. If the confidence level is increased, then the width is increased and so its precision is decreased. The only way to increase the confidence level while maintaining or increasing precision is to use a larger sample size. The sample size can be determined by specifying the confidence level and the required precision. Suppose for example that we would like to estimate the proportion who favor our candidate to within 0.025 with 95% confidence. These goals require that the confidence interval has the form $\hat{p}\pm e$, where e denotes the required precision, 0.02. Since there is no prior bound available for the population proportion, we must use the conservative standard deviation for the confidence interval, $\hat{p}\pm z/(2\sqrt{n})$. Therefore, to attain these goals we must have

\begin{displaymath}
\frac{z}{2\sqrt{n}} = e \Longleftrightarrow n = \left( \frac{z}{2e}\right)^2,
\end{displaymath}

where z is chosen so that the area between -z and z is 0.95 and e=0.02. Using R gives
z = qnorm(.975)
e = .02
n = (z/(2*e))^2
n
[1] 2400.912

If the actual population proportion is close to 0 or 1, then this sample size will be much larger than is required for the stated goals. In such situations if we have a prior bound on the population proportion, we can incorporate that bound to improve the sample size determination. If we would like to estimate the proportion of memory chips that do not meet specifications and we have a prior bound, $p\le p_0$ for the proportion, then the confidence interval will have the form,

\begin{displaymath}
\hat{p}\pm z\sqrt{p_0(1-p_0)/n}.
\end{displaymath}

This gives

\begin{displaymath}
\frac{z\sqrt{p_0(1-p_0)}}{\sqrt{n}} = e \Longleftrightarrow n =
\left( \frac{z}{e}\right)^2 p_0(1-p_0).
\end{displaymath}

If we require that the estimate of this proportion be within .02 of the population proportion with 90% confidence, and we have a prior bound $p\le 0.15$, then z=1.645, $p_0 = 0.15$, and so the sample size would be

\begin{displaymath}
n = \left( \frac{1.645}{.02}\right)^2 (.15)(.85) = 863.
\end{displaymath}


next up previous
Next: Estimation of a Population Up: Estimation Previous: Estimation
Larry Ammann
2014-12-08