A reminder not to be overly impressed when presented with statistically significant coefficients, from FiveThirtyEight.com.
On his blog, Gilles Saint-Paul comments on the publication process in economics.
Of course I was wrong in all accounts. The publication process in economics is not a publication process, it is a validation process by which we acquire a certain rank in a certain pecking order. Submitting a paper to a journal has nothing to do with research dissemination, it is far more similar to taking an exam or participating in a sports competition. The actual dissemination takes place mostly orally, in seminars and conferences; these seminars and conferences are also important validation events, because they allow authors to signal some of their characteristics that may influence their position in the pecking order, while not being easy to infer from their papers.
Now, when you take an exam as a student, you are graded by your professor, not by a fellow student – who would be a competitor if this exam is actually a contest. …
Yet this is the way our own profession is organized. Each submission is “peer reviewed’, that is, it has to be accepted by anonymous referees who happen to be participating in the same beauty contest as the author(s), most often in the same subcategory. At a minimum, as believers of cost-benefit analysis, we should consider that the journal editors and referees themselves perform a cost-benefit analysis when deciding whether or not to publish a paper. I must say that if I apply such a theory to explain my own experience with acceptances and rejections, I easily get an R2 of 80 %.
In an Independent Review of BIS Research, Franklin Allen, Charles Bean and José De Gregorio conclude that
… BIS research clearly ‘punches above its weight’ compared to its central bank peers. Finally, the relative performance of the BIS has clearly improved over the past five years, a tribute to the influence of the previous (Steve Cecchetti) and current (Claudio Borio and Hyun Shin) leadership …
They recommend, among other points:
The research programme should have a more clearly defined long-term focus, be less driven
by short-term needs, and seek to be more holistic in approach.
The internal culture should be more open to challenge and research should avoid focussing
on generating results to support the ‘house view’.
More and more researchers adopt the programming language Julia.
The Economist reports about research by Paul Smaldino and Richard McElreath indicating that studies in psychology, neuroscience and medicine have low statistical power (the probability to correctly reject a null hypothesis). If, nevertheless, almost all published studies contain significant results (i.e., rejections of null hypotheses), then this is suspicious.
Furthermore, Smaldino and McElreath’s research suggests that
the process of replication, by which published results are tested anew, is incapable of correcting the situation no matter how rigorously it is pursued.
With the help of a model of competing research institutes, Smaldino and McElreath simulate how empirical scientific research progresses. Labs that find more new results also tend to produce more false positives. More careful labs try to rule out false positives but publish less. More “successful” labs are allowed to replicate. As a consequence, less careful labs spread out. Replication—repetition of randomly selected findings—does not stop this process.
poor methods still won—albeit more slowly. This was true in even the most punitive version of the model, in which labs received a penalty 100 times the value of the original “pay-off” for a result that failed to replicate, and replication rates were high (half of all results were subject to replication efforts).
Smaldino and McElreath conclude that “top-performing laboratories will always be those who are able to cut corners”—even in a world with frequent replication. The Economist concludes that
[u]ltimately, therefore, the way to end the proliferation of bad science is not to nag people to behave better, or even to encourage replication, but for universities and funding agencies to stop rewarding researchers who publish copiously over those who publish fewer, but perhaps higher-quality papers.
only the top 10–20 percent of a typical graduating class of economics PhD students are likely to accumulate a research record that might lead to tenure at a medium-level research university. … graduating from a top department is neither necessary nor sufficient for becoming a successful research economist. Top researchers come from across the ranks of PhD-granting institutions, and lower-ranked departments produce stars with some regularity, although with lower frequency than the higher-ranked departments. Most of the graduates of even the very highest-ranked departments produce little, if any, published research.
The Economist discussed the article here.
The Economist doubts that science is self-correcting as “many more dodgy results are published than are subsequently corrected or withdrawn.”
Referees do a bad job. Publishing pressure leads researchers to publish their (correct and incorrect) results multiple times. Replication studies are hard and thankless. And everyone seems to be getting the statistics wrong.
A researcher suffers from a type I error when she incorrectly rejects an hypothesis although it is true (false positive); and from a type II error when she incorrectly accepts an hypothesis although it is wrong (false negative). A good testing procedure minimises the type II error given a specified type I error that is, it maximises the power of the test. While employing a test with a power of 80% is considered good practice actual hypothesis testing often suffers from much lower power. As a consequence, many or even a majority of apparent “results” identified by a test might be wrong while most of the “non-results” are correctly identified. Quoting from the article:
… consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5%—that is, 45 of them—will look right because of type I errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong.