In this article, I present five reasons why working reproducibly pays off in the long run and is in the self-interest of every ambitious, career-oriented scientist.
Reason number 1: reproducibility helps to avoid disaster
“How bright promise in cancer testing fell apart” titled a The New York Times article published in summer 2011 [1] highlighting the work of Keith Baggerly and Kevin Coombes, two biostatisticians at M.D. Anderson Cancer Center. Baggerly and Coombes had exposed lethal data analysis problems in a series of high-impact papers by breast cancer researchers from Duke University [2].
The issues discovered by Baggerly and Coombes could have easily been spotted by any co-author before submitting the paper. The data sets are not huge and can easily be spot-checked on a standard laptop. You do not have to be a statistics wizard to realize that patient numbers differ, labels got swapped or samples appear multiple times with conflicting annotations in the same data set. Why did no one notice these issues before it was too late? Because the data and analysis were not transparent and required forensic bioinformatics to untangle [2].
For me, this example provides a powerful motivation to be more transparent and reproducible in my own work. Even smaller disasters can be embarrassing. Here is an example from my own research. Our experimental collaboration partners were validating a pathway model that we had generated computationally. When writing the paper, however, we hit a crucial roadblock: no matter how hard we tried, we could not reproduce our initial pathway model. Maybe the data had changed, maybe the code was different, or maybe we just couldn’t remember the parameter settings of our method correctly. Had we published this result, we would not have been able to demonstrate how the validated hypothesis was generated from the initial data. We would have published a miracle.
This experience showed me two things. First of all, a project is more than a beautiful result. You need to record in detail how you got there. And second, starting to work reproducibly early on will save you time later. We wasted years of our and our collaborators’ time by not being able to reproduce our own results. All of this could have been avoided by keeping better track of how the data and analyses evolved over time.
Reason number 2: reproducibility makes it easier to write papers
Transparency in your analysis makes writing papers much easier. For example, in a dynamic document (Box 1) all results automatically update when the data are changed. You can be confident your numbers, figures and tables are up-to-date. Additionally, transparent analyses are more engaging, more eyes can look over them and it is much easier to spot mistakes.
Here is another example from my own work. In a different project [3], a collaborating clinician and I were discussing why some survival results in a multi-centre study did not come out as expected. Because all the data and analysis code were available to us in an easy-to-read file, we could explore the question ourselves. By simply generating a table of the variable describing tumor stage, we were able to spot the problem: what we expected to see were the stage numbers 1–4, what we saw were entries like ‘XXX’, ‘Fred’ and ‘999’. The people who had given us the data had apparently done a poor job in curating it. Looking into the data ourselves was much quicker and more engaging than going to the postdoc working on the project and saying, ‘Figure this out for us’. My collaborator and I are much too busy to spend too much time on low-level data cleaning, and without the well documented analysis we would not have been able to contribute; but because we had very transparent data and code, it cost us just five minutes to spot a mistake.
Reason number 3: reproducibility helps reviewers see it your way
Most of us like to moan about peer review. One of the complaints I hear most often is: the reviewers didn’t even read the paper and had no idea what we were really doing.
This starkly contrasts with my experience during the review process of a recent paper [4], for which we had made the data and well-documented code easily accessible to the reviewers. One of the reviewers proposed a slight change to some analyses, and because he had access to the complete analysis, he could directly try out his ideas on our data and see how the results changed. The reviewer was completely on board, the only thing left to discuss was the best way to analyze the data. Exactly how a constructive review should be. And it would have been impossible without a transparent and reproducible presentation of our analyses.
Reason number 4: reproducibility enables continuity of your work
I would be surprised if you hadn’t heard the following remarks before, maybe you have even said them yourself: “I am so busy, I can’t remember all the details of all my projects” or “I did this analysis 6 months ago. Of course I can’t remember all the details after such a long time” or “My principle investigator (PI) said I should continue the project of a previous postdoc, but that postdoc is long gone and hasn’t saved any scripts or data”.
Think about it, all of these issues can be solved by documenting data and code well and by making them easily accessible. This point is particularly important for PIs who work on challenging long-term projects. How can you ensure the continuity of work in your lab if progress is not documented reproducibly? In my own group, I don’t even discuss results with students if they are not documented well. No proof of reproducibility, no result!
Reason number 5: reproducibility helps to build your reputation
For several papers, we have made our data, code and analyses available as an Experiment Package on Bioconductor [5]. When I came up for tenure, I cited all of these packages as research output of my lab. Generally, making your analyses available in this way will help you to build a reputation for being an honest and careful researcher. Should there ever be a problem with one of your papers, you will be in a very good position to defend yourself and to show that you reported everything in good faith.
The recent paper published in Science “Scientific standards. Promoting an open research culture” [6] summarizes eight standards and three levels of reproducibility guidelines. Using tools such as R and knitR (Box 1) will make it likely that you comply easily with the highest-level guideline — and again, that is good for your reputation.