Large Sample Estimation of a Population Proportion

Suppose we randomly select *n* individuals from the population of voters
and let *N* denote the proportion in the sample who favor a particular
candidate. Then is our estimate of . The value of this
estimate depends on the individuals who are selected for the sample. To
understand how we can make use of this fact to make a statement about
estimation error, consider the following *thought experiment* (an
experiment that we don't actually perform, but can think about). Suppose we
select every possible sample of size *n* from the population and for
each sample we obtain the sample proportion who favor this candidate. These
estimates will vary from 0 to 1 and the actual sampling experiment we perform,
selecting a random sample of size *n* and obtaining its sample
proportion, is equivalent to randomly selecting one proportion from the
population of proportions obtained from all possible samples of size
*n*. Although we could not perform this experiment in reality, we can
perform it mathematically. If we can determine the distribution of the
population of all possible sample proportions, then we can use this
distribution to make a **probability** statement about the estimation
error. The Central Limit theorem states that if *n* is
large, then the distribution of *N* is approximately a normal
distribution with mean and variance . Therefore, the
distribution of has approximately a normal distribution with mean
and variance (see the plot below). This distribution is
called the **sampling distribution** of . One of the properties
of normal curves is that approximately 95% of a normally distributed
population lies within 2 standard deviations of the mean. In this case that
means that approximately 95% of all possible samples of size *n* have
sample proportions that are within 2 standard deviations of their mean .
Therefore, when we randomly select our sample proportion from the population of
all possible sample proportions, the **probability** is approximately 0.95
that the error of estimation, the difference between the estimate and the
actual proportion, will be no more than 2 standard deviations,
. This represents a bound on the error of estimation. It
is not an absolute bound, but is a reasonable bound in the sense that there is
only a 5% chance that the error of estimation will exceed this bound.

For example, suppose we randomly select 500 voters and find that 260 of these
voters favor this candidate. Then our estimate of the population proportion is
. We are about 95% certain that the error of this
estimate is no more than
. The problem that remains to
be solved is that this error bound depends on the value of , which is
unknown. There are two approaches we can take to solve this problem. The first
approach is to note that the function

is a bounded function of with upper bound . The plot below shows how this function depends on .

This implies that the bound on the error of estimation is at most
. Therefore, we can make the following
statement about the proportion of voters who favor our candidate based on the
information contained in our sample: *the estimated proportion who
favor our candidate is 0.520 and we are about 95% certain that this estimate
is no more than
from the actual population proportion*.
Another way of stating this is that we are about 95% certain that the
population proportion is within the interval
, that is,
between 0.475 and 0.565.

This bound on the error of estimation of a population proportion is conservative in the sense that it does not depend on the actual population proportion. However, if is close to 0 or 1, then it will be too conservative because in this case, the value of would be much smaller than the upper bound. It can be seen from the plot that if , then , so the upper bound becomes too conservative when the population proportion is below .2 or above .8. In some situations, we may have prior information in the form of a bound on that allows us to place a bound on . Suppose, for example, that we wish to estimate the proportion of memory chips that do not meet specifications, and we know from past history that this proportion has never exceeded 0.15. In that case, we can say that . If a sample of 400 memory chips is randomly selected from a production run, and it is found that 32 fail to meet specifications, then the estimated population proportion is , and a bound on the error of estimation would be . We could present these results as follows: The estimated proportion of memory chips that do not meet specifications is 0.080. With 95% certainty, this proportion could be as low as 0.044 or as high as 0.116.

If we do not have available any prior bounds on the population proportion, then
we could use in place of in the error bound. That is, the
estimated bound on the error of estimation would be

One of the interpretations of the estimate of the proportion of voters who
favor our candidate is that we are 95% confident that this proportion is
between 0.475 and 0.565. This interval represents a range of
reasonable values for the population proportion. The confidence level of 95%
is determined by the use of 2 standard deviations for the error bound and the
property of normal curves that approximately 95% of a population falls within
2 standard deviations from the mean. However, this also implies that there
is a 5% chance that the estimation error is greater than the stated bound,
or that there is a 5% chance that the interval does not contain the
population proportion. If there are very serious consequences of reporting an
error bound that turns out to be too small, then we should decide what is an
acceptable risk that the error bound is too small. We can then use the
appropriate number of standard deviations so that the risk is acceptably
small. Suppose for example that we are willing to accept a risk of 1% that
the error bound is too small or that the resulting interval of reasonable
values does not include the population proportion. To accomplish this, we must
find the z-score such that the area between *-z* and *z* is
0.99. To find this z-score, we must look for the area of 0.99/2 = 0.495.
The z-score that is closest to that area is z=2.58. The resulting
interval is

In this case we are 99% confident that the proportion of voters who favor our candidate is somewhere within this interval. Such intervals are called

where

Confidence intervals have two related properties: the level of confidence
and the precision as measured by the width of the confidence interval. These
properties are inversely related. If the confidence level is increased, then
the width is increased and so its precision is decreased. The only way to
increase the confidence level while maintaining or increasing precision is
to use a larger sample size. The sample size can be determined by specifying
the confidence level and the required precision. Suppose for example that we
would like to estimate the proportion who favor our candidate to within 0.025
with 95% confidence. To attain these goals requires that the confidence
interval have the form , where *e* denotes the required
precision, 0.025. Since there is no prior bound available for the population
proportion, we must use the conservative standard deviation for the confidence
interval,
. Therefore, to attain these goals we must
have

where

In situations where we have a prior bound on the population proportion, we
can incorporate that bound into the sample size determination. If we would like
to estimate the proportion of memory chips that do not meet specifications and
we have a prior bound, for the proportion, then the confidence
interval will have the form,

This gives

If we require that the estimate of this proportion be within .02 of the population proportion with 90% confidence, and we have a prior bound on the population of , then , , and so the sample size would be

2013-12-17