Skip to content
June 16, 2016 / Damien Irving

How to write a reproducible paper

As mentioned in a previous call for volunteers, I dedicated part of my PhD to proposing a solution to the reproducibility crisis in modern computational research. In a nutshell, the crisis has arisen because most papers do not make the data and code underpinning their key findings available, which means it is impossible to replicate and verify the results. A good amount progress has been made with respect to documenting and publishing data in recent years, so I specifically focused on software/code. I looked at many aspects of the issue including the reasons why people don’t publish their code, computational best practices and journal publishing standards, much of which is covered in an essay I published with the Bulletin of the American Meteorological Society. That essay is an interesting read if you’ve got the time (in my humble opinion!), but for this post I wanted to cut to the chase and outline how one might go about writing a reproducible paper.

On the surface, the reproducible papers I wrote as part of my PhD (i.e. as a kind of proof of concept; see here and here) look similar to any other paper. The only difference is a short computation section placed within the traditional methods section of the paper. That computation section begins with a brief, high-level summary of the major software packages that were used, with citations provided to any papers dedicated to documenting that software. Authors of scientific software are increasingly publishing overviews of their software in journals like the Journal of Open Research Software and Journal of Open Source Software, so it’s important to give them the academic credit they deserve.

Following this high level summary, the computation section points the reader to three key supplementary items:

  1. A more detailed description of the software used
  2. A copy of any code written by the authors to produce the key results
  3. A description of the data processing steps taken in producing each key result (i.e. a step-by-step account of how the software and code were actually used)

I’ll look at each of these components in turn, considering both the bare minimum you’d need to do in order be reproducible and the extra steps you could take to make things easier for the reader.

 

1. Software description

While the broad software overview provided in the computation section is a great way to give academic credit to those who write scientific software, it doesn’t provide sufficient detail to recreate the software environment used in the study. In order to provide this level of detail, the bare minimum you’d need to do is follow the advice of the Software Sustainability Institute. They suggest documenting the name, version number, release date, institution and DOI or URL of each software package, which could be included in a supplementary text file.

While such a list means your environment is now technically reproducible, you’ve left it up to the reader to figure out how to get all those software packages and libraries installed and playing together nicely. In some cases this is fine (e.g. it might be easy enough for a reader to install the handful of MATLAB toolboxes you used), but in other cases you might want to save the reader (and your future self) the pain of software installation by making use of a tool that can automatically install a specified software environment. The simplest of these is conda, which I discussed in detail in a previous post. It is primarily used for the management of Python packages, but can be used for other software as well. I install my complete environment with conda, which includes non-Python command line utilities like the Climate Data Operators, and then make that environment openly available on my channel at anaconda.org. Beyond conda there are more complex tools like Docker and Nix, which can literally install your entire environment (down to the precise operating system) on a different machine. There’s lots of debate (e.g. here) about the potential and suitability of these tools as a solution to reproducible research, but it’s fair to say that their complexity puts them out of reach for most weather and climate scientists.

 

2. Code

The next supplementary item you’ll need to provide is a copy of the code you wrote to execute those software packages. For a very simple analysis that might consist of a single script for each key result (e.g. each figure), but it’s more likely to consist of a whole library/collection of code containing many interconnected scripts. The bare minimum you’d need to do to make your paper reproducible is to make an instantaneous snapshot of that library (i.e. at the time of paper submission or acceptance) available as supplementary material.

As with the software description, this bare minimum ensures your paper is reproducible, but it leaves a few problems for both you and the reader. The first is that in order to provide an instantaneous snapshot, you’d need to make sure that all your results were produced with the latest version of your code library. In many cases this isn’t practical (e.g. Figure 3 might have been generated five months ago and you don’t want to re-run the whole time consuming process), so you’ll probably want to manage your code library with a version control system like Git, Subversion or Mercurial, so you can easily access previous versions. If you’re using a version control system you might as well hook it up to an external hosting service like GitHub or Bitbucket, so you’ve got your code backed up elsewhere. If you make your GitHub or Bitbucket repository publicly accessible then readers can view the very latest version of your code (in case you’ve made any improvements since publishing the paper), as well as submit proposed updates or bug fixes via the useful interface (which includes commenting, chat and code viewing features) that those websites provide.

 

3. Data processing steps

A code library and software description on their own are not much use to a reader;
they also need to know how that code was used in generating the results presented. The simplest way to do this is to make your scripts executable at the command line, so you can then keep a record of the series of command line entries required to produce a given result. Two of the most well known data analysis tools in the weather and climate sciences – the netCDF Operators (NCO) and Climate Data Operators (CDO) – do exactly this, storing that record in the global attributes of the output netCDF file. I’ve written a Software Carpentry lesson showing how to generate these records yourself, including keeping track of the corresponding version control revision number, so you know exactly which version of the code was executed.

As before, while these bare minimum log files ensure that your workflow is reproducible, they are not particularly comprehensible. Manually recreating workflows from these log files would be a tedious and time consuming process, even for just moderately complex analyses. To make things a little easier for the reader (and your future self), it’s a good idea to include a README file in your code library explaining the sequence of commands required to produce common/key results. You might also provide a Makefile that automatically builds and executes common workflows (Software Carpentry have a nice lesson on that too). Beyond that the options get more complex, with workflow management packages like VisTrails providing a graphical interface that allows users to drag and drop the various components of their workflow.

 

Summary

In order to ensure that your research is reproducible, you need to add a short computation section to your papers. That section should cite the major software packages used in your work, before linking to three key supplementary items: (1) a description of your software environment, (2) a copy of your code library and (3) details of the data processing steps taken in producing each key result. The bare minimum you’d need to do for these supplementary items is summarised in the table below, along with extension options that will make life easier for both the reader and your future self.

If you can think of other extension options to include in this summary, please let me know in the comments below!

 

Minimum Extension
Software description Document the name, version number, release date, institution and DOI or URL of each software package Provide a conda environment.yml file; use Docker / Nix
Code library Provide a copy of your code library Version control that library and host it in a publicly accessible code repository on GitHub or Bitbucket
Processing steps Provide a separate log file for each key result Include a README file and possibly Makefile in code library; provide output (e.g. a flowchart) from a workflow management system like Vistrails
Advertisements

4 Comments

Leave a Comment
  1. Rebecca Orrison / Sep 29 2016 05:18

    Hello – pre-grad school climate scientist here.

    I read through this post as well as your previous call for volunteers. Can point to some information on current status of reproducibility? I’ve seen recently work which I would say fail a reasonable standard, and would be interested in your perspective on collective efforts, particularly in climate sciences.

    • Damien Irving / Sep 29 2016 08:40

      Hi, Rebecca. In my BAMS essay (http://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-15-00010.1) I cite a few reproducible papers in the climate sciences (see below), but the reality is that essentially all climate science papers are not reproducible, because none of them provide their code. As a community we should working hard to fix this issue as a matter of urgency, but instead it’s barely on the radar.

      Irving, D., & Simmonds, I. (2015). A Novel Approach to Diagnosing Southern Hemisphere Planetary Wave Activity and Its Influence on Regional Climate Variability. Journal of Climate, 28(23), 9041–9057. http://doi.org/10.1175/JCLI-D-15-0287.1

      Stevens, B. (2015). Rethinking the lower bound on aerosol radiative forcing. Journal of Climate, 28(12), 4794–4819. http://doi.org/10.1175/JCLI-D-14-00656.1

      Irving, D., & Simmonds, I. (2016). A New Method for Identifying the Pacific–South American Pattern and Its Influence on Regional Climate Variability. Journal of Climate, 29(17), 6109–6125. http://doi.org/10.1175/JCLI-D-15-0843.1

Trackbacks

  1. Need help with reproducible research? These organisations have got you covered. | Dr Climate
  2. The Research Police | Dr Climate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: