Failure to Replicate: Is Psychology in Crisis?
By Naomi Stapleton, Psychology, 2016
In a landmark study, Brian Nosek’s “Reproducibility Project” found significant results for only 36 percent of the 100 psychology studies they replicated. These results, or perhaps lack thereof, have left the field in turmoil.
This study has been widely interpreted as an attack on psychology. For many, this confirms long-held suspicions about the legitimacy of psychological science, especially in light of recent reports of data tampering and fraud.
Nosek emphasizes that his intention with the project was not to create controversy about psychology, but rather to generate an open discussion about the research process itself. 270 contributing authors conducted 100 replications, with experimental protocol guidance and review provided by the original research teams. Replication authors chose experiments based on their own expertise, but this did not affect the likelihood of successful replication.
Many in the field support Nosek in these intentions, saying that replication is a natural and necessary part of science. One of the study’s coauthors, Cody Christopherson from Southern Oregon University, said, “This project is not evidence that anything is broken. It’s impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.” Lisa Feldman-Barrett, Northeastern professor, echoed this idea in her recent New York Times op-ed. She argued that psychology is not in crisis because “failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.” She explains that this replication disparity propels research to further refine a theory. If both versions of a study are well thought-out and conducted, then the minor change between replications may in fact prove to be the deciding factor.
“failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.”
The question remains whether psychological studies are in fact being appropriately designed and carefully conducted. Brian Earp, a research associate in Science and Ethics at the University of Oxford, argued that Feldman-Barrett is unfortunately making a big leap in assuming this to be true; he called for immediate review of a seriously flawed process. Clearly, cases of data fraud must not be tolerated, but Earp highlighted several more low-profile but critical problems as well. The first is intrinsic publication bias: journals only publish exciting, significant results that confirm the researcher’s hypothesis (which may have only been decided in hindsight). This pressure for headline-worthy publications means that until now, replications have rarely been conducted.
This pressure also drives researchers to try as many statistical methods as it takes to tease significance out of muddy results, a method known as “p-hacking.” In fact, statisticians now question this reliance on p-values completely (the 0.05 threshold by which researchers determine significance). When statistician Ronald Fisher introduced the concept of p-values in the 1920s, he intended it more as an informal and initial guide to interpreting results. Steven Goodman from Stanford University agrees, saying: “The numbers are where the scientific discussion should start, not end.” He argues that the p-value is too arbitrary a cut-off point on which to base scientific significance.
Arguably, these issues plague many areas of scientific research, not only psychology. A 2011 study in Nature found that in the preceding decade, retractions of papers have increased tenfold. Efforts to rectify these problems before publication are ongoing in all scientific disciplines. For example, the Retraction Blog documents and publicizes the removal of papers from journals, thus facilitating a more transparent research process. Several journals now publish “pre-registered” studies, which require the research team to submit and obtain approval for their hypothesis before testing it. Dorothy Bishop from the University of Oxford strongly supports pre-registration as a mandatory process. She explained, “Simply put, if you are required to specify in advance what your hypothesis is and how you plan to test it, then there is no wiggle room for cherry-picking the most eye-catching results after you have done the study.”
In truth, Nosek’s Replication Project is the first of its kind; we do not yet know what his results mean for the field. “Science needs to involve taking risks and pushing frontiers, so even an optimal science will generate false positives,” said Sanjay Srivastava from the University of Oregon. “If 36 percent of replications are getting statistically significant results, it is not at all clear what that number should be.”
Science (2015). DOI: 10.1126/science.aac4716.