AP Statistics Lectures
by Arnold Kling

Regression in Microsoft Excel

In Microsoft Excel, Regression is under "Tools/Data Analysis". (If you do not see "Data Analysis" under "tools", then you have to install it. To install it, go to "tools/add-ins" and add the "analysis toolpak.")

Here are the steps to do regression in Excel.

1. Set up the data in columns. It might look like this:

yx1x2
213
423
533
744
052

2. Click on "tools/data analysis/regression" and you will get a dialog box. Where it asks you to enter the Y range, go to your spreadsheet and highlight the data cells in the Y column and enter those. For X range, enter the cell range for both X columns combined. Also, check the box that says "residuals" and the box that says "residual plots." Then press "enter" and you get a new worksheet with "summary output."
3. In the Summary output, you get regression statistics. For the example above, these look like this:
 Multiple R 0.928 R Square 0.861 Adjusted R Square 0.723 Standard Error 1.421 Observations 5
Multiple R is the correlation coefficient between the actual and the fitted values. R Square is the square of the correlation coefficient Adjusted R Square means adjusted for degrees of freedom. The standard error is the standard error of the regression.

4. Also in the Summary output, you get "ANOVA," which means "analysis of variance." For the sample data, this shows
 df SS MS F Significance F Regression 2 25.2 12.6 6.22 0.138 Residual 2 4.04 2.02 Total 4 29.2

The regression means the fitted values, the residual means the residuals, and total means the overall Y values.

The degrees of freedom is the number of observations minus the number of X variables. Assuming that the regression includes a constant term (the default option), that counts as an X variable. Thus, if we have x1, x2, and a constant term, then there are three x variables, which if there are five observations means there are only two degrees of freedom for the regression

The SS column stands for sum of squares. It shows the Pythagorean relationship, where the sum of squared fitted values plus the sum of squared residuals add up to the sum of squared Y values.

The MS column probably stands for something like "mean square." I've never used it.

The F statistic is used to calculate the significance of the regression. It is analogous to the chi-square statistic in categorical data. The significance level for F is like a P-value. In this case, an F of .138 suggests that the regression is not signicant at the 10 percent level (because the sample size is so small).

Finally, Excel provides a table of coefficients, their standard errors, and confidence intervals. Below, I just show the coefficients and standard errors.

 Coefficient Standard Err t Stat Intercept -8.08 3.72 -2.18 X variable .263 .46 .57 X variable 3.63 1.03 3.52

This says that the regression equation is

Y = -8.08 + .263X1 + 3.63X2

The t-value for the coefficient on X1 is very low. This means that the coefficient is not significantly different from zero and that this variable is not adding any explanatory power to the equation. If your goal was strictly to predict Y, you would drop this variable and re-estimate an equation using only a constant term and X2 on the right-hand side.

Finally, the residuals and the "residual plots" let you look for patterns in the residuals. These can allow you to detect nonlinearity, influential points, and other complications.