The Economist reports about “outcome switching”—promoting empirical evidence collected in the context of a specific hypothesis test (that didn’t succeed) as support for a different hypothesis.
Outcome switching is a good example of the ways in which science can go wrong. This is a hot topic at the moment, with fields from psychology to cancer research going through a “replication crisis”, in which published results evaporate when people try to duplicate them.
The Economist doubts that science is self-correcting as “many more dodgy results are published than are subsequently corrected or withdrawn.”
Referees do a bad job. Publishing pressure leads researchers to publish their (correct and incorrect) results multiple times. Replication studies are hard and thankless. And everyone seems to be getting the statistics wrong.
A researcher suffers from a type I error when she incorrectly rejects an hypothesis although it is true (false positive); and from a type II error when she incorrectly accepts an hypothesis although it is wrong (false negative). A good testing procedure minimises the type II error given a specified type I error that is, it maximises the power of the test. While employing a test with a power of 80% is considered good practice actual hypothesis testing often suffers from much lower power. As a consequence, many or even a majority of apparent “results” identified by a test might be wrong while most of the “non-results” are correctly identified. Quoting from the article:
… consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5%—that is, 45 of them—will look right because of type I errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong.