Type I errors, Type II errors, and Congress

The Wall Street Journal reports,

Lawmakers of both parties questioned Sunday whether law-enforcement officials did enough to monitor the activities of suspected Boston Marathon bomber Tamerlan Tsarnaev before last week’s terrorist attack, given his apparent extremist beliefs.

The failure to stop Tsarnaev was a type I error. However, there are probably hundreds of young men in America with profiles that have at least as many “red flags” as he had, and few, if any, are likely to commit acts of terrorism. One sure bet is that for the next several years we will see a lot more type II errors, in which the FBI monitors innocent people.

Speaking of Type I errors, type II errors, and Congress, I will be testifying at a hearing on mortgage finance on Wednesday morning for the House Committee on Financial Services. Part of what I plan to say:

It is impossible to make mortgage decisions perfectly. Sometimes, you make a reasonable decision to approve a loan, and later the borrower defaults. Sometimes, you make a reasonable decision to deny a loan, and yet the loan would have been repaid. Beyond that, good luck with home prices can make any approval seem reasonable and bad luck with home prices can make any approval seem unreasonable. During the bubble, Congress and regulators beat up on mortgage originators to get them to be less strict. Since then, Congress and regulators have been beating up on mortgage originators to be especially strict. I expect mortgage originators to make mistakes, but the fact is that they do a better job without the “advice” that they get from you.

Here is my talk on type I and type II errors for my housing course.

This is Too Much

Jared Bernstein writes,

had R&R gone through the peer-review process, I’m fairly confident that a) the spreadsheet error would NOT have been found, but b) the paper would have been sent back to them for failing to provide even a cursory analysis of the possibility of reverse causality (slower growth leading to higher debt/GDP ratios vs. the R&R claim of the opposite). Re “a,” peer reviewers do not routinely replicate findings, though they should when possible (more work these days is with proprietary data sets which cannot legally be shared).

Pointer from Mark Thoma.

I have not been commenting on the Reinhart and Rogoff fracas. My view of empirical macroeconomics is that there are hardly any reliable findings, so I always brushed aside the notion that there is some adverse growth impact of having a debt to GDP ratio of 90 percent. But some people took it seriously. And now the left is howling that all of the austerity policies in the world are due to Reinhart and Rogoff, and they should be burned at the stake, or something.

But speaking of unreliable findings in empirical macroeconomics, this is the same Jared Bernstein who co-authored a memo for President Obama saying that the multiplier is 1.54, as if we know what it is with that precision (I do not think we know with any precision that it even has a positive sign.) He has about as much right to complain about Reinhart and Rogoff as a crack-head has to complain about somebody who got drunk once.

And do read F. F. Wiley. (pointer from Tyler Cowen)

More on Schooling, Deschooling, and the Null Hypothesis

Four links.
1. A NYT article on computerized grading of essays. I highlight the response of the Luddites:

“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman, a retired director of writing and a current researcher at M.I.T.

He is among a group of educators who last month began circulating a petition opposing automated assessment software. The group, which calls itself Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment, has collected nearly 2,000 signatures, including some from luminaries like Noam Chomsky.

“Let’s face the realities of automatic essay scoring,” the group’s statement reads in part. “Computers cannot ‘read.’ They cannot measure the essentials of effective written communication: accuracy, reasoning, adequacy of evidence, good sense, ethical stance, convincing argument, meaningful organization, clarity, and veracity, among others.”

Suppose, for the sake of argument, that the software does poorly now and can be fooled easily. My bet is that within five years there will be software that can pass a Turing test of the following sort.

a. Assign 100 essays to be graded by four humans and the computer.

b. Show the graded essays to professors, without telling them which set was computer-graded, and have them rank the five sets of essays in terms of how well they were graded.

c. See if the computer’s grading comes in higher than 5th.

While we are waiting for this test, the NYT article points to a nice paper by Mark D. Shermis summarizing results of a comparison of various software essay-grading systems.

2. Isegoria points to Bloom’s 2-Sigma Problem,

The two-sigma part refers to average performance of ordinary students going up by two standard deviations when they received one-to-one tutoring and worked on material until they mastered it, and the problem part refers to the fact that such tutoring doesn’t come cheap.

I am skeptical. It is possible that this educational intervention is so radically different from anything else that has ever been tried that it works much better than other interventions. But I would bet that if another set of researchers were to attempt to replicate this study, they would fail to find similar results. In social science in general, we do too little replication. This is particularly important when someone claims to have made a striking finding.

3. In the comments on this post, I found this one particularly interesting and articulate:

I think K-12 public schools are about warehousing children, giving parents childcare, whether they are at work or simply want a break from being around their kids (the quality of parenting going on is incredibly wide-ranging).

…why the current system is still in place-Cost, Convenience, Comfortability and Childcare. Unfortunately, the one-size-fits-all approach is ineffective, makes young people passionately hate school (which breeds some serious anti-intellectual pathologies) and is becoming even more centralized in curriculum and control. (See Common Core curriculum adopted by 48 states.)

I think that the Childcare aspect deserves more notice. When President Obama supports universal pre-school, the “scientific” case is based almost entirely on taking kids out of homes of low-functioning parents. But what affluent parents hear is “Obama is going to pay for my child care,” and that is what makes the policy popular.

More generally, assume that as a parent you believe that your comparative advantage is to work, rather than spend the entire day with your child. Then ask yourself why as a parent you would prefer to have your child in school rather than home without supervision. Even if the child learns less at school than they would at home, you still might prefer the school, as long as you are convinced that it reduces the risk of your child getting into really bad trouble.

4. From Michael Strong, in a long comment pushing back on my post last week.

No one doubts that if one compares one group that receives significant practice in an activity against another group with no exposure to the activity at all, that a treatment effect exists.

Why then are so many people skeptical that interventions in education make a difference? Largely because the comparisons exist between idiotic variations within a government-dominated industry.

As a rejoinder, I might start by changing “receives significant practice” to “engages in significant practice.” “Learning a skill” and “engaging in significant practice” are so closely related that I would say that, to a first approximation, they are the same thing.

This leads me to the following restatement of the null hypothesis.

The null hypothesis is that when you attempt an educational intervention, such as a new teaching method, the overall economic value of the skills that an individual acquires from age 5 to 20 is not affected by that intervention. I will grant that if you take two equivalent groups of young people and give one group daily violin lessons and the other group daily clarinet lessons then the first group is more likely to end up better violinists on average.

But when economists measure educational outcomes, they usually look at earnings, which result from the market value of skills acquired. To affect that, you have to affect the ability and willingness of a person to engage in practice in a combination of generally applicable fields and fields that are that person’s comparative advantage.

Aptitude and determination matter. Consider Malcolm Gladwell’s “10,000 hour rule” for becoming an expert at something. There is a huge selection bias going on in that rule. How many people who have little aptitude for shooting a basketball are going to keep practicing basketball for 10,000 hours?

When you consider how hard it is to move the needle half a standard deviation on a fourth-grade reading comprehension exam, the chances are slim that you are going to come up with something that affects long-term overall outcomes. Until we get the Young Lady’s Illustrated Primer.

Causal Density is a Bear

The Economist reports,

The mismatch between rising greenhouse-gas emissions and not-rising temperatures is among the biggest puzzles in climate science just now. It does not mean global warming is a delusion. Flat though they are, temperatures in the first decade of the 21st century remain almost 1°C above their level in the first decade of the 20th. But the puzzle does need explaining.

On a separate but related topic, Noah Smith writes,

DSGE models are highly sensitive to their assumptions. Look at the difference in the results between the Braun et al. paper and the Fernandez-Villaverde et al. paper. Those are pretty similar models! And yet the small differences generate vastly different conclusions about the usefulness of fiscal policy. Now realize that every year, macroeconomists produce a vast number of different DSGE models. Which of this vast array are we to use? How are we to choose from the near-infinite menu of very similar models, when small changes in the (obviously unrealistic) assumptions of the models will probably lead to vastly different conclusions? Not to mention the fact that an honest use of the full nonlinear versions of these models (which seems only appropriate in a major economic upheaval) wouldn’t even give you definite conclusions, but instead would present you with a menu of multiple possible equilibria?

James Manzi’s Uncontrolled pinpoints the problem, in what he calls causal density. When there are many factors that have an impact on a system, statistical analysis yields unreliable results. Computer simulations give you exquisitely precise unreliable results. Those who run such simulations and call what they do “science” are deceiving themselves.

Russ Roberts and Edward Leamer

I love this video, but that is because I agree so much with Leamer.

One thing I would point out about his charts is that he uses trend lines and implies that mean reversion is the norm. That is, for most of the postwar period, if you had a recession that took GDP below trend, you would then have above-trend growth. An alternative hypothesis is that real GDP follows a random walk with drift. That would mean that it always tends to grow at 3 percent, regardless of its recent behavior. The last three recession seem to follow such a model.

In the late 1980s, some folks, notably Charles Nelson and Charles Plosser, argued strenuously against mean reversion and in favor of the random walk with drift. Note that this is back when Leamer describes output and employment as mean-reverting. I wonder if what happens as data get revised over long periods of time is that random walks get turned into mean-reverting trends.

Note Tyler Cowen’s comment on the latest employment report:

we are recovering OK from the AD crisis, but the structural problems in the labor market are getting worse. It’s becoming increasingly clear those structural problems were there all along and also that they are a big part of the real story. On the AD side, mean-reversion really is taking hold, as it should and as is predicted by most of the best neo-Keynesian models.

Quintile Mobility: Built-in Properties

Timothy Taylor writes,

For example, for all those born into the bottom quintile, 44% are still in that quintile as adults. About half as many, 22%, rise to the second quintile by adulthood. The percentages go down from there. … Similarly, those born into the top income quintile are relatively likely to remain in the top. Among children born into the top quintile, 47% are still there as adults. Only 7% fall to the bottom quintile. The experiences of those born into the middle three quintiles are quite different. The distribution among income quintiles as adults is much more even for those born in these three middle groups, suggesting significant mobility for these individuals. … This pattern has led researchers to conclude that the U.S. income distribution has a fairly mobile middle, but considerable “stickiness at the ends” …”

This result is nearly an arithmetical certainty. Suppose that everyone faces three equally-probable outcomes:

–their income as adults puts them in the same quintile as their parents
–their income as adults rises enough to move up a quintile
–their income as adults falls enough (in relative terms) to move down a quintile

If this were the case, then people in the top would have a 2/3 chance of remaining at the top, because those who get lucky have nowhere to go but up within the top quintile. Similarly, people would have a 2/3 chance of remaining at the bottom, because those who get unlucky have nowhere to go but down within the same quintile. People in the middle quintiles would have only a 1/3 chance of remaining in their original quintile, because they can move in either direction. This pattern would lead researchers to conclude that the U.S. income distribution has a fairly mobile middle but considerable stickiness at the ends, even though by construction everyone in all quintiles has the same probability of moving up or down the income scale.

Tyler Cowen on James Buchanan

He writes,

He thought through the conflict between subjective and objective notions of value in economics, and the importance of methodologically individualist postulates, more deeply than perhaps any other economist. Most economists hate this work, or refuse to understand it, either because it lowers their status or because it is genuinely difficult to follow or because it requires philosophy.

This is the aspect of Buchanan that I picked up on, but that is only one of 13 items in Tyler’s post.

Unpacking the Term “Probability”

My new essay on probability concludes:

Producers and consumers live in a world of non-repeatable events…Treating probabilities as if they were objective is a conceptual error. It is analogous to the conceptual errors that treat value as objective…We will be less likely to overstate the robustness of equilibrium and the precision of economic models if we stop conflating subjective degrees of conviction with verifiable scientific concepts of probability.

I argue that one cannot assign an objective probability to a non-repeatable event, such as “will hurricane Sandy cause flooding in the New York subway system?” I could have used “Will Barack Obama win re-election?” as my illustrative example, given that Nathan Singer Silver famously assigned a very precise-sounding probability to that event.