Kevin Lang argues in NBER wp 31666:
When economists analyze a well-conducted RCT or natural experiment and find a statistically significant effect, they conclude the null of no effect is unlikely to be true. But how frequently is this conclusion warranted? The answer depends on the proportion of tested nulls that are true and the power of the tests. I model the distribution of t-statistics in leading economics journals. Using my preferred model, 65% of narrowly rejected null hypotheses and 41% of all rejected null hypotheses with |t|<10 are likely to be false rejections. For the null to have only a .05 probability of being true requires a t of 5.48.
A reminder not to be overly impressed when presented with statistically significant coefficients, from FiveThirtyEight.com.
In separate blog posts, Russ Roberts and John Cochrane have called for humility on the part of economists. Asking “What do economists know?,” Roberts and Cochrane point out—correctly—that economics is not as strong on quantification as some economists and many pseudo economists pretend, and as is often expected from economists.
Economics is not the same as applied statistics although the latter can help clarify, at least to some extent, the empirical relevance of economic theories. Correlation does not imply causation. Identifying assumptions that aim at establishing causal claims based on correlation analysis deserve skepticism, especially when the process that led to the empirical results remains in the dark (see notes on replicability here, here, here).
Sound economics heavily relies on consistency checking, or bullshit detection in Cochrane’s words. It insists on keeping accounting identities in mind and never forgetting about incentives. And it is acutely aware of the fact that good models are nothing more than consistent stories—but at least they are consistent stories.
The Economist recounts the reasons why the quality of GDP measurement is lacking and why GDP measures cannot answer all the questions they are used to address.
The Economist doubts that science is self-correcting as “many more dodgy results are published than are subsequently corrected or withdrawn.”
Referees do a bad job. Publishing pressure leads researchers to publish their (correct and incorrect) results multiple times. Replication studies are hard and thankless. And everyone seems to be getting the statistics wrong.
A researcher suffers from a type I error when she incorrectly rejects an hypothesis although it is true (false positive); and from a type II error when she incorrectly accepts an hypothesis although it is wrong (false negative). A good testing procedure minimises the type II error given a specified type I error that is, it maximises the power of the test. While employing a test with a power of 80% is considered good practice actual hypothesis testing often suffers from much lower power. As a consequence, many or even a majority of apparent “results” identified by a test might be wrong while most of the “non-results” are correctly identified. Quoting from the article:
… consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5%—that is, 45 of them—will look right because of type I errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong.