AP Statistics Audio Lectures

Linear Regression

by Arnold Kling

Linear Regression

by Arnold Kling

To hear the lecture, click here.

Regression is used when the dependent variable is numerical data (measured in dollars, inches, or some other numerical scale). Often, the independent variables are also numeric. In this course, we will focus on using just one independent variable (single regression) as opposed to several independent variables (multiple regression).

For example, suppose we measure average income and average longevity in different countries. Each country gives us a data point. For example, the U.S. might have an average income of $30,000 and an average longevity of 77 years. Mexico might have an average income of $8000 and an average longevity of 70 years.

If we plot data for many countries, we might see a tendency for longevity to be higher when income is higher. We say that longevity and income are positively correlated.

Here is an example of a scatterplot that appeared in the *New York Times* accompanying an article in their health section.

Always plot data first. Looking at a scatterplot, you should draw an imaginary line that seems to best fit the points and ask

- Is the slope clearly positive or negative? (positive correlation, negative correlation, no correlation)
- Are the points close to the line? If yes, then the linear relationship is strong.
- Is there a nonlinear relationship?
- Influential Points
- Outliers

The equation of the line is Y = a + bX. For example, suppose that we get a line of best fit that says that longevity = 60 + 0.8X, where X is average income measured in tens of thousands of dollars ($15,000 would be 15). The set of actual points is (X_{i}, Y_{i}). The line predicts that for an average income of $8000 longevity will be 60 + .8(8) = 66.4 years. If the actual value of Y is 70 years, then the difference between the two, called the residual is 70-66.4 = 3.6 years. For the point ($30,000, 77), the line predicts 60 + .8(30) = 84 years. The residual is 77-84 = -7 years.

y^_{i} = a + bX_{i}

y_{i} - y^_{i} = e_{i}

minimize Se_{i}^{2}

Two equivalent formulas for b

b = S(X_{i} - )(Y_{i} - )/S(X_{i} - )^{2} = s_{XY}/s^{2}_{X}

or b = rs_{Y}/s_{X}

where r is the correlation coefficient. r = s_{XY}/(s_{X}s_{Y})

a =

- bb is the slope of the line--the change in Y per a one unit change in X. a is the intercept of the line--the predicted value of Y when X is zero.

With the calculator, you would create two lists (A and B) and use calc/linreg to obtain a and b.

r measures the relationship between X and Y. A positive r corresponds to a positive relationship. A negative r corresponds to a negative relationship.

r cannot be greater than 1 in absolute value. r close to 1 is a near-perfect fit (positive). r close to -1 is a near-perfect fit (negative).

R^{2} is the square of r. It is always positive. With a good fit, it is close 1, with a bad fit it is close to 0.

Pythagorean relationship: variance of Y = variance of y^ + Se_{i}^{2}

implies 1 = R^{2} + Se_{i}^{2}/(variance of Y)

Correlation does not show causality. Possibilities include reverse causality or third factor. For example, does a higher savings rate lead to better health, or vice-versa? Or are both the consequence of the way people make decisions?