Degrees of Freedom

AP Statistics Lectures

by Arnold Kling

Sample Standard Deviation and Degrees of Freedom

The sample mean, x, is the sum of all of the values of x, divided by n. So why isn't the sample variance the sum of all of the squared deviations of x from x divided by n? Why do we divide by n-1 instead? Alternatively, why do we take S(x_i - x)²/n and multiply by a "correction factor" of (n-1)/n?

The answer is that you lose a degree of freedom when you use the sample to estimate the mean. One way to think of this is that if you tell me the sample mean and then tell me the deviation of every observation but one, I can tell you the value of the last observation without looking at it. That is because I know that the sum of the deviations from the mean must equal zero.

You would not lose a degree of freedom if you had a completely independent estimate of the mean. If you were given the true mean and calculated the deviations of your sample values of x from the true mean, then you could divide by n. If you were given an estimate of the mean from an independent sample, say y, and calculated the deviations of your sample values of x from y, then you could divide by n.

To see why you cannot divide by n, expand the first term of the sum of squared deviations. That is, expand (x₁ - x)². You get

x₁² - 2x₁x + x² = x₁² - 2x₁S(x_i/n) + [S(x_i/n)]²

Each of the x's in your sample is uncorrelated with one another, but x₁ is certainly correlated with itself. As a result, when we take the expected value, the middle term will be -2x₁²/n and the last term will be + x₁²/n, so that netting out, we have

(1/n)[x₁² - x₁²/n] = (1/n)[(n-1)/n]x₁²]

We see the factor [(n-1)/n] emerging. This is the bias in the uncorrected sample variance. What it means is that the uncorrected sample variance, x₁²/n, has to be multiplied by [n/(n-1)] in order to produce an unbiased estimate of the true population variance, s².