Null Hypothesis watch

In 1987, Peter Rossi wrote,

The Iron Law of Evaluation: “The expected value of any net impact assessment of any large scale social program is zero.”

The Iron Law arises from the experience that few impact assessments of large scale2 social programs have found that the programs in question had any net impact. The law also means that, based on the evaluation efforts of the last twenty years, the best a priori estimate of the net impact assessment of any program is zero, i.e., that the program will have no effect.

The Stainless Steel Law of Evaluation: “The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.”

This law means that the more technically rigorous the net impact assessment, the more likely are its results to be zero—or not effect. Specifically, this law implies that estimating net impacts through randomized controlled experiments, the avowedly best approach to estimating net impacts, is more likely to show zero effects than other less rigorous approaches. [pg5]

The Brass Law of Evaluation: “The more social programs are designed to change individuals, the more likely the net impact of the program will be zero.”

This law means that social programs designed to rehabilitate individuals by changing them in some way or another are more likely to fail. The Brass Law may appear to be redundant since all programs, including those designed to deal with individuals, are covered by the Iron Law. This redundancy is intended to emphasize the especially difficult task in designing and implementing effective programs that are designed to rehabilitate individuals.

I arrived at this by following Tyler Cowen’s recommendation to check out Gwern and starting to read the latter’s essay on why correlation is so frequent and causation is so rare.

My comments on the Rossi article.

1. James Manzi had very similar thoughts in Uncontrolled. Is that correlation or causation? Concerning the “brass law,” Manzi said that you are more likely to effect change by taking people’s nature as given and changing their incentives.

2. Imagine how much more often we would see these sorts of results if it were not for social desirability bias in reporting on interventions.

Adversity and SAT scores

The WSJ, had an article in the print edition on November 27 that I cannot find on line (their search function is not helpful). The print article was called ‘Adversity’ Has Big Effect on SAT Scores. What I can find online instead is this:What Happens if SAT Scores Consider Aversity? Find Your School.

Anyway, the WSJ uses a Georgetown education researcher’s regression equation relating SAT scores to “adversity scores” to make inferences such as

Top public magnet schools performed exceptionally well in adjusted SAT scores, meaning their scores jump when adversity is accounted for.

To see why this is not a valid inference, suppose that there were two students of identical backgrounds but different ability levels. Presumably, the magnet school would select the student with higher ability, leaving the other student to attend a regular school. The more able student would get a higher SAT score, but that would say nothing about the magnet school’s “performance.”

I sent a letter to the editor of the WSJ about this, but they did not print it. But I hope that someone there gets the message that this was statistical malpractice.

Doubts about teacher value added

Marianne Bitner and others write,

Using administrative data from New York City, we find estimated teacher “effects” on height that are comparable in magnitude to actual teacher effects on math and ELA achievement, 0.22: compared to 0.29: and 0.26: respectively. On its face, such results raise concerns about the validity of these models.

. . .our results provide a cautionary tale for the naïve application of VAMs to teacher
evaluation and other settings. They point to the possibility of the misidentification of sizable teacher
“effects” where none exist. These effects may be due in part to spurious variation driven by the typically
small samples of children used to estimate a teacher’s individual effect.

VAMs = value-added measures. Pointer from a reader. I note that some recent NBER working papers are now free downloads. Others are not. This one is.

Lest you miss the point, this paper shows that the same methods that purport to show an effect of teachers on student achievement also show an effect of teachers on student height. But the effect of teachers on height is almost surely spurious. So the effect of teachers on achievement may also be spurious.

1. This provides vindication for Jerry Muller’s The Tyranny of Metrics.

2. It provides support for the Null Hypothesis.

3. The research that seemed to show a big effect of teachers (e.g., Raj Chetty on kindergarten teachers) got a lot of play in the press. But that had social desirability bias going for it. I would be surprised if this paper receives similar notice.

On this year’s Nobel Prize in economics

It goes to Abhijit Banerjee, Esther Duflo and Michael Kremer for work on field experiments in the economics of (under-) development. Alex Tabarrok at Marginal Revolution has coverage, starting here.

I am currently drafting an essay suggesting Edward Leamer for the Nobel Prize. Last week, I wrote the following paragraph:

The significance of what Angrist and Pischke termed the “credibility revolution in empirical economics” can be seen in the John Bates Clark Medal awards given to researchers who participated in that revolution. Between 1995 and 2015, of the fourteen Clark Medal winners, by my estimate at least seven (Card, Levitt, Duflo, Finkelstein, Chetty, Gentzkow, and Fryer) are known for their empirical work using research designs intended to avoid the problems that Leamer highlighted with the multiple-regression approach.

This year’s Nobel, by including Duflo, would seem to serve to strengthen my case for Leamer.

Vector autoregression

A commenter asks,

I’m curious what your opinion on Christopher Sims’s econometric work is now.

Sims is another macro-econometrician who was awarded a Nobel Prize for work that I think is of no use.

The problem in macro is causal density–there is a high ratio of plausible causal mechanisms to data. If you have dozens of causal variables and only a relative handful of data points, what do you do?

The conventional approach was for the investigator to impose many constraints on the regression model. This is mathematically equivalent to adding rows of “data” that do not come from the real world but instead are constructed by the investigator to conform exactly to the investigator’s theoretical pet theories. The net result is that you learn what the investigator wanted the data to look like. But other investigators can–and do–produce very different empirical narratives for the same real-world observations.

Sims’ approach was for the investigator to narrow down the number of causal variables, so that the computer can produce a model without the investigator doctoring the data. But that is not a solution to the causal density problem. If there are many important causal variables in the real world, then in a non-experimental setting, restricting yourself to looking at a few variables at a time is pointless.

Correlation without causation

April L. Bleske-Rechek writes,
Mayviewed a random sample of poster abstracts that had been accepted for presentation at an annual convention of the premier professional organization in psychology, the Association for Psychological Science. We were disappointed to find that over half of the abstracts that included cause and effect language did so without warrant (i.e., the research was correlational). Of course, poster presentations are held to a less rigorous standard than are formal talks or published journal articles, so in a follow-up study, we reviewed 660 articles from 11 different well-known journals in the discipline. Our findings replicated: over half of the articles with cause and effect language described studies that were actually correlational; in other words, the causal language was not warranted.

In biological families, children resemble their parents in vocabulary and verbal ability; in adoptive families, they do not. The key implication is that Hart and Risley’s finding of a link between parents’ verbal behavior and their children’s verbal ability does not warrant an inference that parents’ verbal behavior influences their children’s verbal ability.

Somebody should put together a YouTube course on “How to be skeptical of statistical studies.” I nominate Russ Roberts.

Probability and mass shootings

E. Fuller Torrey writes,

there are now some one million people with serious mental illness living among the general population who, 60 years ago, would have been treated in state mental hospitals. Multiple studies have reported that, at any given time, between 40% and 50% of them are receiving no treatment for their mental illness.

He blames de-institutionalization for the problem of mass murders.

Ordinarily, I try to avoid commenting on the stories that dominate the news for short periods of time. I have a lot of doubts about going ahead with this post, but here goes.

I have a problem with every policy proposal that I have seen for dealing with mass shootings. The problem comes from Bayes’ theorem, which says that the probability of A given B is not the same as the probability of B given A.

When I taught AP statistics, I often used the 9/11 attacks as an example. Nearly all of the terrorists were Saudi nationals. But only a tiny percentage of Saudi nationals are terrorists. So a policy based on the assumption that Saudi nationals are the problem is going to involve a lot of costs relative to potential terrorist acts prevented.

The same thinking applies to guns. Guns account for 100 percent of mass shootings. But only a small percentage of guns are involved in such shootings. If guns provide a benefit to the people who do not use them for mass shootings, then trying to get at mass shootings by going after guns is going to involve a high ratio of costs to benefits.

The same probabilistic reasoning also applies to mental illness. Suppose that people with untreated mental illness account for 100 percent of mass shootings. There are still half a million untreated mentally ill who are not mass shooters. If the cost of mandatory treatment for them is high, then that may not be a good strategy for trying to reduce mass shootings.

Perhaps there are other social benefits to forced institutionalization of those who are mentally ill. That might improve the benefit/cost analysis for such a policy. But even so, it would be difficult to defend from a humanitarian or libertarian viewpoint.

As long as mass shootings remain rare relative the causal factors that are most often cited, it will be hard to come up with a cost-effective solution. This RAND meta-analysis supports my view.

Ben Thompson writes,

it was on 8chan — which was created after complaints that the extremely lightly-moderated anonymous-based forum 4chan was too heavy-handed — that a suspected terrorist gunman posted a rant explaining his actions before killing 20 people in El Paso. This was the third such incident this year: the terrorist gunmen in Christchurch, New Zealand and Poway, California did the same; 8chan celebrated all of them.

Hence, he supports censorship of 8chan. I disagree, although I find it a close call. My thoughts:

1. Correlation does not imply causation. The fact that terrorists were active on 8chan could mean that 8chan attracts individuals who are inclined toward violence, but it does not necessarily increase their propensity toward violence.

2. As a test, you may substitute “radical Islamic preacher” or “Palestinian primary school that teaches kids to hate Jews” for 8chan, and see whether you support the step of absolutely shutting down their right to speak. Maybe you do. I do not.

3. Probably the most effective way to use censorship to reduce mass shootings would be to refuse to allow the media to cover them. As it is, the mainstream media are giving mass shooters the notoriety that seems to be their main motivation. Sometimes there are suggestions that the media voluntarily exercise restraint, for instance in not naming the shooter. But as far as I know nobody wants to impose censorship on the mainstream media, even though they appear to be at least as guilty of aiding and abetting mass shooters as are the dark-web media.

4. Ultimately, censorship gives power to the censors. As time passes, the trend will be for censors to exercise more and more power with less and less wisdom, objectivity, and discretion. I think it is best to stay off the slippery slope altogether.

UPDATE: a commenter points to a very similar post by Craig Pirrong, aka the Streetwise Professor. Note that the first comment on that post repeats my point 3.

The genes that did not matter

For predicting depression. The authors of this study report

The implication of our study, therefore, is that previous positive main effect or interaction effect findings for these 18 candidate genes with respect to depression were false positives. Our results mirror those of well-powered investigations of candidate gene hypotheses for other complex traits, including those of schizophrenia and white matter microstructure.

Read Scott Alexander’s narrative about their findings.

As I understand it, a bunch of old studies looked at one gene at a time in moderate samples and found significant effects. This study looks at many genes at the same time in very large samples and finds that no one gene has significant effects.

The results are not reported in a way that I can clearly see what is happening, so the following is speculative:

1. It is possible that the prior reports of a significant association of a particular gene with greater incidence of depression are due to specification searches (trying out different “control” variables until you find a set that produces “significant” results).

2. It is possible that publication bias meant that although many attempts by other researchers to find “significant” results failed, those efforts were not reported.

3. These authors use a different, larger data sample, and perhaps in that sample the incidence of depression could be measured with greater error than in the smaller samples used by previous investigators. Having a larger data sample increases your chance of finding “significant” results, but measurement error reduces your chances of finding “significant” results. The authors are aware of the measurement-error issue and they conduct an exercise intended to show that this could not be the main source of their failure to replicate other studies.

4. If I understand it correctly, previous studies each tended to focus on a small number of genes, perhaps just one. This study includes many genes at once. If my understanding is correct, then in this new study the authors are now controlling for many more factors.

Think of it this way. Suppose you do a study of cancer incidence, and you find that growing up in a poor neighborhood is associated with a higher cancer death rate. Then somebody comes along and does a study that includes all of the factors that could affect cancer incidence. This study finds that growing up in a poor neighborhood has no effect. A reason that this could happen is that once you control for, say, propensity to smoke, the neighborhood effect disappears.

In the case of depression, suppose that the true causal process is for 100 genes to influence depression together. A polygenic score explains, say, 20 percent of the variation in the incidence of depression across a population. Now you go back to an old study that just looks at one gene that happens to be relatively highly correlated with the polygenic score.

In finance, we say that a stock whose movements are highly correlated with those of the overall market is a high-beta stock. The fact that XYZ corporation’s share price is highly correlated with the S&P 500 does not mean that XYZ’s shares are what is causing the S&P to move. Similarly, a “high-beta” gene for depression would not signify causality, if instead a broad index of genes is what contributes to the underlying causal process.

Further comments:

(1) and (2) are fairly standard explanations for a failure to replicate. But Alexander points out that in this case it is not just one or two studies that fail to replicate, but hundreds. That would make this a very, very sobering example.

If (3) is the explanation (i.e., more measurement error in the new study), then the older studies may have merit. It is the new study that is misleading.

If (4) is the explanation, then the “true” model of genes and depression is closer to a polygenic model. The single-gene results reflect correlation with other genes that influence the incidence of depression rather than direct causal effects.

If (4) is correct, then the “new” approach to genetic research, using large samples and looking at many genes at once, should be able to yield better predictions of the incidence of depression than the “old” single-gene, small-sample approach. But neither approach will yield useful information for treatment. The old approach gets you correlation without causation. The new approach results in a causal model that is too complex to be useful for treatment, because too many genes are involved and no one gene suggests any target for intervention.

I thank Russ Roberts for a discussion last week over lunch, without implicating him in any errors in my analysis.