AP Statistics Lectures

by Arnold Kling

Regression in Microsoft Excel

In Microsoft Excel, Regression is under "Tools/Data Analysis". (If you do not see "Data Analysis" under "tools", then you have to install it. To install it, go to "tools/add-ins" and add the "analysis toolpak.")

Here are the steps to do regression in Excel.

- Set up the data in columns. It might look like this:
y x1 x2 2 1 3 4 2 3 5 3 3 7 4 4 0 5 2 - Click on "tools/data analysis/regression" and you will get a dialog box. Where it asks you to enter the Y range, go to your spreadsheet and highlight the data cells in the Y column and enter those. For X range, enter the cell range for both X columns combined. Also, check the box that says "residuals" and the box that says "residual plots." Then press "enter" and you get a new worksheet with "summary output."
- In the Summary output, you get regression statistics. For the example above, these look like this:
Multiple R 0.928 R Square 0.861 Adjusted R Square 0.723 Standard Error 1.421 Observations 5 - Also in the Summary output, you get "ANOVA," which means "analysis of variance." For the sample data, this shows
df SS MS F Significance F Regression 2 25.2 12.6 6.22 0.138 Residual 2 4.04 2.02 Total 4 29.2 The regression means the fitted values, the residual means the residuals, and total means the overall Y values.

The degrees of freedom is the number of observations minus the number of X variables. Assuming that the regression includes a constant term (the default option), that counts as an X variable. Thus, if we have x1, x2, and a constant term, then there are three x variables, which if there are five observations means there are only two degrees of freedom for the regressionThe SS column stands for sum of squares. It shows the Pythagorean relationship, where the sum of squared fitted values plus the sum of squared residuals add up to the sum of squared Y values.

The MS column probably stands for something like "mean square." I've never used it.

The F statistic is used to calculate the significance of the regression. It is analogous to the chi-square statistic in categorical data. The significance level for F is like a P-value. In this case, an F of .138 suggests that the regression is not signicant at the 10 percent level (because the sample size is so small).Finally, Excel provides a table of coefficients, their standard errors, and confidence intervals. Below, I just show the coefficients and standard errors.

Coefficient Standard Err t Stat Intercept -8.08 3.72 -2.18 X variable .263 .46 .57 X variable 3.63 1.03 3.52 This says that the regression equation is

Y = -8.08 + .263X

_{1}+ 3.63X_{2}The t-value for the coefficient on X

_{1}is very low. This means that the coefficient is not significantly different from zero and that this variable is not adding any explanatory power to the equation. If your goal was strictly to predict Y, you would drop this variable and re-estimate an equation using only a constant term and X_{2}on the right-hand side.Finally, the residuals and the "residual plots" let you look for patterns in the residuals. These can allow you to detect nonlinearity, influential points, and other complications.

### A Project

Download a spreadsheet by clicking here and saving the file as something like "mytraffic.xls."

This is based on the monthly page views that we got when I was running a web site called Homefair. The goal of the project is to come up with a regression equation to predict page views in subsequent months. The challenge is that the relationship between page views and time is nonlinear.

You can start by estimating a linear regression with "traffic" as the dependent variable and "period" as the independent variable. You can then try a number of alternatives. For example, you can create a column that is the log of "traffic"--call this lntraffic. You can use that as the dependent variable. Also, you could try estimating two separate regressions, one for periods 1-40 and another for periods 41-79.

Other ideas you might consider would be inserting other columns for an indepemt variable. You might create a variable that is set equal to 1 each January and equal to 0 all other months. You can call this variable "January." See if using that variable along with "period" in the regression improves the fit, as measured by adjusted R-squared.

Plots of the residuals may suggest to you other ways to modify "period" to help improve the fit. Try some of them.