The Central Limit Theorem
In the last lecture, we talked about estimating the mean value of household income in a zip code. We found that the expected value of our estimate was equal to the true mean of household income. We found that the standard deviation of our estimate of the mean was equal to the (known) standard deviation of the population divided by the square root of the sample size.
The Central Limit Theorem states that as the sample size gets sufficiently large, the distribution of our estimate of the mean becomes approximately normal. The normal approximation typically is quite good with 25 observations or more.
We have been speaking as if we know the true variance of the underlying distribution, and we are estimating the mean. In practice, this is not a typical situation. Instead, in many real-world cases, we have to estimate both the mean and the variance of the underlying distribution. It turns out that we can handle this situation, but we have to adjust the z table, particularly in small samples. We will get into this issue in a few days.
This is true, even though the underlying distribution is not normal. The actual distribution of income in zip code 20902 is likely to be highly skewed. Nonetheless, the sampling distribution of our estimate of the mean income will approximate the normal distribution.
The Central Limit Theorem means that we can use z tables to find out the likelihood that a sample mean will deviate from the true mean by more than a certain amount. For example, suppose that the average height of adult women is 64.5 inches, with a standard deviation of 2.5 inches. What is the probability that the average height in a random sample of 40 women will be greater than 64.55 inches?
The expected value for the average height in our sample is the population mean of 64.5 inches. The standard deviation of the mean of our sample is equal to the square root of (2.5/40), or 0.0625. Therefore, an average of 65.0 inches would represent a difference of 0.05 inches, or .8 standard deviations. The chance of this occurring is .2119, or 21.19 percent.
A related fact about averages in large samples is that the distribution of a sample proportion is approximately normal. A sample proportion is the number of successes in a binomial distribution. Even though the underlying binomial distribution is not normal, the estimate of the proportion derived from a sample of observations is approximately normal, with the approximation better as the sample size gets larger.
The normal approximation works like this. Suppose that the population mean of the probability that a girl will agree to dance with me is 20 percent. Then the standard deviation is the square root of (.2)(.8) = .4.
Suppose I ask a sample of 25 girls to dance. The number who accept, divided by 25, is the proportion. This proportion who accept will be approximately normally distributed with a mean of mean will be .2 and a standard deviation of .4 divided by the square root of 25.