AP Statistics Audio Lectures

Topics in Regression

by Arnold Kling

Topics in Regression

by Arnold Kling

To hear the lecture, click here.

Not all relationships are linear. Often, you can tell using a scatterplot that the relationship is not going to be linear. Otherwise, you may find evidence of nonlinearity in a plot of the residuals. That is, instead of plotting Y against X, you plot the residuals of a linear regression against X.

The residuals average zero by construction, so that the residuals will be scattered above and below a horizontal line at zero. The pattern to watch out for is residuals that tend to all be above the line for a while, then below the line, then above the line again. Or vice-versa. If all of the residuals on the far left and far right have the same sign, and the residuals in the middle have the opposite sign, then this indicates a nonlinear relationship and says that your linear regression was not the right choice.

There are many functional forms for nonlinear regression. The two discussed in this course are power regression and exponential regression.

- Exponential regression: Y = ab
^{X} - Power regression: Y = aX
^{b}

What will these functions look like? You could try various values of a and b to see. Try plotting the power regression for these combinations of values:

a = 1, b = 1/2

a = 1, b = 2

a = -1, b = 1/2

a = -1, b = 2

The exponential regression is best suited for data that curve up (concave up). For data that ascend but at a decreasing rate (concave down), the power regression works best.

The calculator has functions for both power regression and exponential regression.

Exponential and power regressions can be thought of as linear regressions using data that has been transformed into logs. We will use natural logarithms here.

If we start with

ln(Y) = a + bX and take e to the power of both sides, we have

Y = e^{a+bX} = e^{a}e^{bX} = e^{a}(e^{b})^{X}

which is the exponential regression

If we instead start with

ln(Y) = ln(a) + bln(X) and take e to the power of both sides, we have

Y = aX^{b}

which is the power regression

Explains why these two functional forms were popular

Useful to plot data in log format

Suppose that we have a dependent variable, such as grade-point average as a college freshman, that can be predicted on the basis of two variables: high school GPA; and SAT score

Can estimate equation Y = a + bX_{1} + cX_{2}

Multiple regression (could include other X's also)

Typical question is whether a relationship is significant. Worry about both practical and statistical significance. Worry about slope coefficient, b

Usual null hypothesis is that b = 0

standard error of regression, se, equals sqrt[Se_{i}^{2}/(n-2)]

The standard error of b, s_{b} equals se/s_{X}

The ratio of b to its standard error has a t distribution with n-2 degrees of freedom, where n is the number of observations. In multiple regression, the degrees of freedom is n-k, where k is the number of explanatory variables (including the intercept).

Suppose that the dependent variable is muscle strength and the independent variable is lean body mass (LBM). Example taken from Gerard E. Dallal. A printout might say,

Variable | Coeff | Std Err | t | p-value |
---|---|---|---|---|

const | -13.9 | 10.3 | -1.34 | .181 |

LBM | 3.016 | 0.22 | 13.8 | .000 |

What is the equation of the line? Y = -13.9 + 3.016X, where Y is muscle strength and X is LBM.

Is the slope significant? Yes, p-value is essentially 0

Other printout information

R^{2}: .760

Adj. R^{2}: .756 (adjusts for degrees of freedom--bigger adjustment in multiple regression)

Regression Sum of Squares: 68789 (variance of y-hat)

Residual Sum of Squares: 21770 (variance of residual)

Total: 90559

note the Pythagorean relationship

Note that 68789/90559 = .760, which is R^{2}