The results of the previous section are derived from the Central Limit Theorem for proportions. We can use similar methods to estimate the mean of a population. We will first consider this estimation problem when the population has a normal distribution, and then we will examine the extension of these methods to populations that are not necessarily normally distributed.
The CLT for sample means states that if the population has approximately a normal distribution with
mean and standard deviation , then the distribution of is
. We can interpret this the same way we interpreted the CLT
for proportions. Imagine taking every possible sample of size n from the population and
finding the mean for each sample. The histogram of these sample means will be approximately a
normal distribution with mean and s.d.
. This implies that we can use
as an estimate of . The error of estimation is then
, and we
can make the following probability statement about this error,
The problem here is that this confidence interval depends on , the population standard
deviation. In most situations, is unknown as well as . Sometimes we may have prior
information available that gives an upper bound for ,
, which can be
incorporated into the confidence interval,
This problem was solved around 100 years ago by a statistician named William Gossett, who solved it while working for Guinness brewery. Because of non-disclosure agreements in his employment contract, Gossett had to publish his solution under the pseudonym Student. For this reason, the distribution of when is a random sample from a normal distribution is called Student's t distribution. This distribution is similar to the standard normal distribution and represents an adjustment to the sampling distribution of caused by replacing the constant with a random variable s. As the sample size increases, s becomes a better estimate of , and so less adjustment is required. Therefore, the t-distribution depends on the sample size. This dependence is expressed by a function of the sample size called degrees of freedom, which for this problem is n-1. That is, the sampling distribution of is a t-distribution with n-1 degrees of freedom. A plot that compares several t-distributions with the standard normal distribution is given below. Note that the t-distribution is symmetric and has relatively more area in the extremes and less area in the central region compared to the standard normal distribution. Also, as the degrees of freedom increases, the t-distribution converges to the standard normal distribution.
We now can make use of Gossett's result to obtain a confidence interval for ,
We can find the appropriate t-values in R using the quantile for the t-distribution, qt(p,df), which has two arguments. Suppose we obtain a random sample of size 20 from a population that is approximately normal and wish to estimate the population mean using a 95% confidence interval. If the sample mean is 45 and the sample s.d. is 12, then the t-value has 19 d.f. and so the confidence interval is given by interval is
n = 20 alpha = .05 xm = 45 xs = 12 t = qt(1-alpha/2,n-1) conf.int = xm + c(-t,t)*xs/sqrt(n) conf.int  39.38383 50.61617Note that the t-value in this case is 2.093 which is greater than the corresponding z-value of 1.96. This reflects the added uncertainty caused by needing to estimate the population s.d. Note that if the sample size had been 80 instead of 20, then the confidence interval would have been more narrow.
n = 80 alpha = .05 xm = 45 xs = 12 t = qt(1-alpha/2,n-1) conf.int = xm + c(-t,t)*xs/sqrt(n) conf.int  42.32953 47.67047
The probability statement associated with this confidence interval,
Sample size determination.
If our estimate must satisfy requirements both for the level of confidence and for the precision of
the estimate, then it is necessary to have some prior information that gives a bound on or
an estimate of . Let denote this bound or estimate, and let denote the
required precision. Then the confidence interval must have the form,
Example. A random sample of 22 existing home sales during the last month showed that the
mean difference between list price and sales price was $4580 with a standard deviation of $1150.
Assume that the differences between list and sales prices have approximately a normal distribution
and construct a 95% confidence interval for the mean difference for all existing home sales. What
would you say if the mean difference between list and sales prices for the same month last year had
been $5500? Suppose you wish to estimate this mean to within $250 with 99% confidence. What
sample size would be required if you use the standard deviation of this sample as an estimate of
Solution. The confidence interval can be obtained using R
alpha = .05 n = 22 m = 4580 s = 1150 t = qt(1-alpha/2,n-1) conf.int = m + c(-t,t)*s/sqrt(n) conf.int  4070.119 5089.881
The interpretation of this interval is that it contains reasonable values for the population mean, reasonable in the sense that we are risking a 5% chance that the actual population mean is not one of these values. If the mean difference between list and sales prices for the same month last year had been $5500, then we could say that the difference between list and sales price this year is less than last year since all of the reasonable values for this year's mean difference are lower than last year's mean. There is a risk of 5% that this conclusion is wrong. Also, if we need to make projections based on the value of the population mean, we could use the projection based on the sample mean as a nominal-case projection and use the endpoints of the interval as best-case/worst-case projections.
Note that the precision of this confidence interval is 510 with 95% confidence. If we require a precision of 250 with 99% confidence, then we must use a larger sample size. We can use the sample standard deviation, s=1150, as an estimate of the for the purpose of sample size determination, but we can't use t-values here since we don't yet know the sample size. This implies that we must use the corresponding z-value instead a t-value to determine n.
alpha = .01 e = 250 s = 1150 z = qnorm(1-alpha/2) n = (z*s/e)^2 n  140.3944So a sample of size 141 would be required to meet these specifications. The actual precision attained by a confidence interval based on a sample of this size may not have a precision that is very close to 250 if the sample standard deviation in our preliminary sample of size 22 is not a good estimate of the actual population standard deviation or if the distribution of the data is not approximately normal.
Since the results discussed above are based on the Central Limit Theorem, we can apply them in the
same way to the problem of estimating the mean of a population that does not necessarily have a
normal distribution. This would lead to the same confidence interval for ,