Many important statistical problems can be expressed as the problem of determining some characteristic of a population when it is not possible or feasible to measure every individual in the population. For example, political candidates may wish to determine the proportion of voters in a state who intend to vote for them; an advertising agency may wish to determine the proportion of a target population who react favorably to an ad campaign; a manufacturer may wish to determine the mean cost per unit associated with warranty costs of a product. Since it is not possible or feasible to contact every individual in the respective populations, the only reasonable alternative is to select in some way a sample from the population and use the information contained within the sample to estimate the population characteristic of interest.
At first thought, it would seem that what should be done here is to select
a representative sample from the population, since such a sample
would mirror the properties of the population. Suppose, for example, that we
would like to determine the proportion of voters in a state who intend to
vote for a particular candidate for governor. Let
denote this proportion.
A representative sample selected from this population should have a sample
proportion that is close to
. The problem though is how to select
such a sample. In fact, it is not possible to do this, for even if the
proportion in the sample were close to
, we would not know it because we
don't know the value of
.
Furthermore, an estimate derived from a sample has no value unless we can make
some statement about its accuracy. Suppose that
is the proportion
in the sample that favor that candidate. Then the error of prediction would be
. Obviously we cannot make an exact statement about
this error since we do not know
. However, if the sample is selected
randomly so that each individual in the population has the same chance of being
selected, then it is possible to make a probability statement about the
estimation error. Random sampling is the only type of sampling with
which we can make reasonable statements about the prediction error.