Large Sample Estimation of a Population Proportion

Suppose we randomly select *n* individuals from the population of voters
and let *N* denote the proportion in the sample who favor a particular
candidate. Then is our estimate of . The value of this
estimate depends on the individuals who are selected for the sample. To
understand how we can make use of this fact to make a statement about
estimation error, consider the following *thought experiment* (an
experiment that we don't actually perform, but can think about). Suppose we
select every possible sample of size *n* from the population and for
each sample we obtain the sample proportion who favor this candidate. These
estimates will vary from 0 to 1 and the actual sampling experiment we perform,
selecting a random sample of size *n* and obtaining its sample
proportion, is equivalent to randomly selecting one proportion from the
population of proportions obtained from all possible samples of size
*n*. Although we could not perform this experiment in reality, we can
perform it mathematically. If we can determine the distribution of the
population of all possible sample proportions, then we can use this
distribution to make a **probability** statement about the estimation
error. The Central Limit theorem states that if *n* is
large, then the distribution of *N* is approximately a normal
distribution with mean and variance . Therefore, the
distribution of has approximately a normal distribution with mean
and variance (see the plot below). This distribution is
called the **sampling distribution** of . One of the properties
of normal curves is that approximately 95% of a normally distributed
population lies within 2 standard deviations of the mean. In this case that
means that approximately 95% of all possible samples of size *n* have
sample proportions that are within 2 standard deviations of their mean .
Therefore, when we randomly select our sample proportion from the population of
all possible sample proportions, the **probability** is approximately 0.95
that the error of estimation, the difference between the estimate and the
actual proportion, will be no more than 2 standard deviations,
. This represents a bound on the error of estimation. It
is not an absolute bound, but is a reasonable bound in the sense that there is
only a 5% chance that the error of estimation will exceed this bound.

For example, suppose we randomly select 500 voters and find that 260 of these
voters favor this candidate. Then our estimate of the population proportion is
. We are about 95% certain that the error of this
estimate is no more than
. The problem that remains to
be solved is that this error bound depends on the value of , which is
unknown. There are two approaches we can take to solve this problem. The first
approach is to note that the function

is a bounded function of with upper bound . The plot below shows how this function depends on .

This implies that the bound on the error of estimation is at most

Therefore, we can make the following statement about the proportion of voters who favor our candidate based on the information contained in our sample:

This bound on the error of estimation of a population proportion is conservative in the sense that
it does not depend on the actual population proportion. However, if is close to 0 or 1, then
it will be too conservative because in this case, the value of would be much smaller
than the upper bound. It can be seen from the plot that if
, then
, so the upper bound becomes too conservative when the population proportion
is below .2 or above .8. In some situations, we may have prior information in the form of a bound
on that allows us to place a bound on . Suppose, for example, that we wish to estimate
the proportion of memory chips that do not meet specifications, and we know from past history that
this proportion has never exceeded 0.15. In that case, we can say that

If a sample of 400 memory chips is randomly selected from a production run, and it is found that 32 fail to meet specifications, then the estimated population proportion is , and a bound on the error of estimation would be . We could present these results as follows: The estimated proportion of memory chips that do not meet specifications is 0.080. With 95% certainty, this proportion could be as low as 0.044 or as high as 0.116.

If we do not have available any prior bounds on the population proportion, then we could use
in place of in the error bound. That is, the estimated bound on the error of
estimation would be

One of the interpretations of the estimate of the proportion of voters who
favor our candidate is that we are 95% confident that this proportion is
between 0.475 and 0.565. This interval represents a range of
reasonable values for the population proportion. The confidence level of 95%
is determined by the use of 2 standard deviations for the error bound and the
property of normal curves that approximately 95% of a population falls within
2 standard deviations from the mean. However, this also implies that there
is a 5% chance that the estimation error is greater than the stated bound,
or that there is a 5% chance that the interval does not contain the
population proportion. If there are very serious consequences of reporting an
error bound that turns out to be too small, then we should decide what is an
acceptable risk that the error bound is too small. We can then use the
appropriate number of standard deviations so that the risk is acceptably
small. Suppose for example that we are willing to accept a risk of 1% that
the error bound is too small or that the resulting interval of reasonable
values does not include the population proportion. To accomplish this, we must
find the z-score such that the area between *-z* and *z* is
0.99. To find this z-score, note that to area above *z* must be *0.005*
and so the total area below *z* is *0.995*. We can use the **R**
quantile function `qnorm` for the normal distribution to obtain this value,

z = qnorm(.995)This gives

In this case we are 99% confident that the proportion of voters who favor our candidate is somewhere within this interval. Such intervals are called

where

Confidence intervals have two inter-related properties: the level of confidence and the precision
as measured by the width of the confidence interval. These properties are inversely related. If the
confidence level is increased, then the width is increased and so its precision is decreased. The
only way to increase the confidence level while maintaining or increasing precision is to use a
larger sample size. The sample size can be determined by specifying the confidence level and the
required precision. Suppose for example that we would like to estimate the proportion who favor our
candidate to within 0.025 with 95% confidence. These goals require that the confidence interval has
the form , where *e* denotes the required precision, *0.02*. Since there
is no prior bound available for the population proportion, we must use the conservative standard
deviation for the confidence interval,
. Therefore, to attain these goals
we must have

where

z = qnorm(.975) e = .02 n = (z/(2*e))^2 n [1] 2400.912

If the actual population proportion is close to 0 or 1, then this sample size will be much larger
than is required for the stated goals. In such situations if we have a prior bound on the population
proportion, we can incorporate that bound to improve the sample size determination. If we would like
to estimate the proportion of memory chips that do not meet specifications and we have a prior
bound, for the proportion, then the confidence interval will have the form,

This gives

If we require that the estimate of this proportion be within .02 of the population proportion with 90% confidence, and we have a prior bound , then

2016-02-04