Skip to content
June 16, 2016 / Damien Irving

How to write a reproducible paper

As mentioned in a previous call for volunteers, I dedicated part of my PhD to proposing a solution to the reproducibility crisis in modern computational research. In a nutshell, the crisis has arisen because most papers do not make the data and code underpinning their key findings available, which means it is impossible to replicate and verify the results. A good amount progress has been made with respect to documenting and publishing data in recent years, so I specifically focused on software/code. I looked at many aspects of the issue including the reasons why people don’t publish their code, computational best practices and journal publishing standards, much of which is covered in an essay I published with the Bulletin of the American Meteorological Society. That essay is an interesting read if you’ve got the time (in my humble opinion!), but for this post I wanted to cut to the chase and outline how one might go about writing a reproducible paper.

On the surface, the reproducible papers I wrote as part of my PhD (i.e. as a kind of proof of concept; see here and here/here) look similar to any other paper. The only difference is a short computation section placed within the traditional methods section of the paper. That computation section begins with a brief, high-level summary of the major software packages that were used, with citations provided to any papers dedicated to documenting that software. Authors of scientific software are increasingly publishing overviews of their software in journals like the Journal of Open Research Software and Journal of Open Source Software, so it’s important to give them the academic credit they deserve.

Following this high level summary, the computation section points the reader to three key supplementary items:

  1. A more detailed description of the software used
  2. A copy of any code written by the authors to produce the key results
  3. A description of the data processing steps taken in producing each key result (i.e. a step-by-step account of how the software and code were actually used)

I’ll look at each of these components in turn, considering both the bare minimum you’d need to do in order be reproducible and the extra steps you could take to make things easier for the reader.

 

1. Software description

While the broad software overview provided in the computation section is a great way to give academic credit to those who write scientific software, it doesn’t provide sufficient detail to recreate the software environment used in the study. In order to provide this level of detail, the bare minimum you’d need to do is follow the advice of the Software Sustainability Institute. They suggest documenting the name, version number, release date, institution and DOI or URL of each software package, which could be included in a supplementary text file.

While such a list means your environment is now technically reproducible, you’ve left it up to the reader to figure out how to get all those software packages and libraries installed and playing together nicely. In some cases this is fine (e.g. it might be easy enough for a reader to install the handful of MATLAB toolboxes you used), but in other cases you might want to save the reader (and your future self) the pain of software installation by making use of a tool that can automatically install a specified software environment. The simplest of these is conda, which I discussed in detail in a previous post. It is primarily used for the management of Python packages, but can be used for other software as well. I install my complete environment with conda, which includes non-Python command line utilities like the Climate Data Operators, and then make that environment openly available on my channel at anaconda.org. Beyond conda there are more complex tools like Docker and Nix, which can literally install your entire environment (down to the precise operating system) on a different machine. There’s lots of debate (e.g. here) about the potential and suitability of these tools as a solution to reproducible research, but it’s fair to say that their complexity puts them out of reach for most weather and climate scientists.

 

2. Code

The next supplementary item you’ll need to provide is a copy of the code you wrote to execute those software packages. For a very simple analysis that might consist of a single script for each key result (e.g. each figure), but it’s more likely to consist of a whole library/collection of code containing many interconnected scripts. The bare minimum you’d need to do to make your paper reproducible is to make an instantaneous snapshot of that library (i.e. at the time of paper submission or acceptance) available as supplementary material.

As with the software description, this bare minimum ensures your paper is reproducible, but it leaves a few problems for both you and the reader. The first is that in order to provide an instantaneous snapshot, you’d need to make sure that all your results were produced with the latest version of your code library. In many cases this isn’t practical (e.g. Figure 3 might have been generated five months ago and you don’t want to re-run the whole time consuming process), so you’ll probably want to manage your code library with a version control system like Git, Subversion or Mercurial, so you can easily access previous versions. If you’re using a version control system you might as well hook it up to an external hosting service like GitHub or Bitbucket, so you’ve got your code backed up elsewhere. If you make your GitHub or Bitbucket repository publicly accessible then readers can view the very latest version of your code (in case you’ve made any improvements since publishing the paper), as well as submit proposed updates or bug fixes via the useful interface (which includes commenting, chat and code viewing features) that those websites provide.

 

3. Data processing steps

A code library and software description on their own are not much use to a reader;
they also need to know how that code was used in generating the results presented. The simplest way to do this is to make your scripts executable at the command line, so you can then keep a record of the series of command line entries required to produce a given result. Two of the most well known data analysis tools in the weather and climate sciences – the netCDF Operators (NCO) and Climate Data Operators (CDO) – do exactly this, storing that record in the global attributes of the output netCDF file. I’ve written a Software Carpentry lesson showing how to generate these records yourself, including keeping track of the corresponding version control revision number, so you know exactly which version of the code was executed.

As before, while these bare minimum log files ensure that your workflow is reproducible, they are not particularly comprehensible. Manually recreating workflows from these log files would be a tedious and time consuming process, even for just moderately complex analyses. To make things a little easier for the reader (and your future self), it’s a good idea to include a README file in your code library explaining the sequence of commands required to produce common/key results. You might also provide a Makefile that automatically builds and executes common workflows (Software Carpentry have a nice lesson on that too). Beyond that the options get more complex, with workflow management packages like VisTrails providing a graphical interface that allows users to drag and drop the various components of their workflow.

 

Summary

In order to ensure that your research is reproducible, you need to add a short computation section to your papers. That section should cite the major software packages used in your work, before linking to three key supplementary items: (1) a description of your software environment, (2) a copy of your code library and (3) details of the data processing steps taken in producing each key result. The bare minimum you’d need to do for these supplementary items is summarised in the table below, along with extension options that will make life easier for both the reader and your future self.

If you can think of other extension options to include in this summary, please let me know in the comments below!

 

Minimum Extension
Software description Document the name, version number, release date, institution and DOI or URL of each software package Provide a conda environment.yml file; use Docker / Nix
Code library Provide a copy of your code library Version control that library and host it in a publicly accessible code repository on GitHub or Bitbucket
Processing steps Provide a separate log file for each key result Include a README file and possibly Makefile in code library; provide output (e.g. a flowchart) from a workflow management system like Vistrails
April 13, 2016 / Damien Irving

Keeping up with Continuum

I’m going to spend the next few hundred characters gushing over a for-profit company called Continuum Analytics. I know that seems a little weird for a blog that devotes much of its content to open science, but stick with me. It turns out that if you want to keep up with the latest developments in data science, then you need to be on top of what this company is doing.

If you’ve heard the name Continuum Analytics before, it’s probably in relation to a widely used Python distribution called Anaconda. In a nutshell, Travis Oliphant (who was the primary creator of NumPy) and his team at Continuum developed Anaconda, gave it away for free to the world, and then built a thriving business around it. Continuum makes its money by providing training, consultation and support to paying customers who use Anaconda (and who are engaged in data science/analytics more generally), in much the same way that RedHat provides support to customers using Linux.

The great thing about companies like RedHat and Continuum is that because their business fundamentally depends on open source software, they contribute a great deal back to the open source community. If you’ve ever been to a SciPy conference (something I would highly recommend), you would have noticed that there’s always a few presentations from Continuum staff, whose primary job appears to be to simply work on the coolest open source projects going around. What’s more, the company seems to have a knack for supporting projects that make life much, much easier for regular data scientists (i.e. people who know how to analyse data in Python, but for which things like system administration and web programming are beyond them). For instance, the projects they support (see the full list here) can help you install software without having to know anything about system admin (conda), create interactive web visualisations without knowing Javascript (bokeh), process data arrays larger than the available RAM without knowing anything about multi-core parallel processing (dask) and even speed up your code without having to resort to a low level language (numba).

Of these examples, the most important achievement (in my opinion) is the conda package manager, which I’ve talked about previously. Once you’ve installed either Anaconda (which comes with 75 of the most popular Python data science libraries already installed) or Miniconda (which essentially just comes with conda and nothing else), you can then use conda to install pretty much any library you’d like with one simple command line entry. That’s right. If you want pandas, just type conda install pandas and it will be there, along with its dependencies, playing nicely with all your other libraries. If you decide you’d like to access pandas from the jupyter notebook, just type conda install jupyter and you’re done. There are about 330 libraries available directly like this and because they are maintained by the Continuum team, they are guaranteed to work.

While this is all really nice, other Python distributions like Canopy also come with a package manager for installing widely used libraries. What sets conda apart is the ease with which the wider community can contribute. If you’ve written a library that you’d like people to be able to install easily, you can write an associated installation package and post it at Anaconda Cloud. For instance, Andrew Dawson (a climate scientist at Oxford) has written eofs, a Python library for doing EOF analysis. Rather than have users of his software mess around installing the dependencies for eofs, he has posted a conda package for eofs at his channel on Anaconda Cloud. Just type conda install -c https://conda.anaconda.org/ajdawson eofs and you’re done; it will install eofs and all its dependencies for you. Some users (e.g. like the US Integrated Ocean Observing System) even go a step further and post packages for a wide variety of Python libraries that are relevant to the work they do. This vast archive of community contributed conda packages means there isn’t a single library I use in my daily work that isn’t available via either conda install or Anaconda Cloud. In fact, a problem I often face is that there is more than one installation package for a particular library (i.e. which one do I use? And if I get an error, where should I ask for assistance?). To solve this problem, conda-forge has recently been launched. The idea is that it will house the lone instance of every community contributed package, in order to (a) avoid duplication of effort, and (b) make it clear where questions (and suggested updates / bug fixes) should be directed.

The final mind blowing feature of conda is the ease with which you can manage different environments. Rather than lump all your Python libraries in together, it can be nice to have a clean and completely separate environment for each discrete aspect of the work you do (e.g. I have a separate environments for my ocean data analysis, atmosphere data analysis and for testing new libraries). This will sound familiar to anyone who has used virtualenv, but again the value of conda environments is the ease with which the community can share. As an example, I’ve shared the details of my ocean data analysis environment (right down to the precise version of every single Python library). I started by exporting the details of the environment by typing conda env export -n ocean-environment -f blog-example, before posting it to my channel at Anaconda Cloud (conda env upload -f blog-example). Anyone can now come along and recreate that environment on their own computer by typing conda env create damienirving/blog-example (and then source activate blog-example to get it running). This is obviously huge for the reproducibility of my work, so for my next paper I’ll be posting a corresponding conda environment to Anaconda Cloud.

If you want to know more about Continuum, I highly recommend this Talk Python To Me podcast with Travis Oliphant.

January 12, 2016 / Damien Irving

Podcasting comes to weather and climate science

Over the past few years, podcasts have begun to emerge as the next great storytelling platform. The format is open to anyone with a laptop, a microphone, and access to the web, which means it’s kind of like blogging, only your audience isn’t restricted to consuming your content via words on a screen. They can listen to you in the car on the way to work, on the stationary bike at the gym or at any other time a little background noise is needed to pass the time away.

While I’m as excited as the next podcast enthusiast about the new season of Serial, what’s even more exciting is that a number of podcasts for weather and climate science nerds have been launched in recent months. These ones have really caught my ear:

  • Forecast: Climate Conversations with Michael White – a podcast about climate science and climate scientists, hosted by Nature’s editor for climate science
  • Mostly Weather – a team from the MetOffice explores a new, mostly weather based topic each month
  • Climate History Podcast – interviews with people in climate change research, journalism, and policymaking. It is the official podcast of the Climate History Network and the popular website HistoricalClimatology.com
  • The Method – a podcast that tells the stories of what is working in science and what is not. It launches in mid-2016 and sounds right up the alley of this blog
  • (Depending on where you live, a Google Search might also turn up a weekly podcast or two that discusses the current weather in your region)

There’s also a number of data science podcasts out there, which can be useful depending on the type of data analysis that you do. I’ve found some of the Talk Python to Me episodes to be very relevant to my daily work.

If you know of any other great weather and climate science podcasts, please share the details in the comments below!

November 5, 2015 / Damien Irving

A call for reproducible research volunteers

Around the time that I commenced my PhD (May 2012… yes, I know I should have finished by now!) there were lots of editorial-style articles popping up in prestigious journals like Nature and Science about the reproducibility crisis in computational research. Most papers do not make the data and code underpinning their key findings available, nor do they adequately specify the software packages and libraries used to execute that code, which means it’s impossible to replicate and verify their results. Upon reading a few of these articles, I decided that I’d try and make sure that the results presented in my PhD research were fully reproducible from a code perspective (my research uses publicly available reanalysis data, so the data availability component of the crisis wasn’t so relevant to me).

While this was an admirable goal, I quickly discovered that despite the many editorials pointing to the problem, I could find very few (none, in fact) regular weather/climate papers that were actually reproducible. (By “regular” I mean papers where code was not the main focus of the work, like it might be in a paper describing a new climate model.) A secondary aim of my thesis therefore became to consult the literature on (a) why people don’t publish their code, and (b) best practices for scientific computing. I would then use that information to devise an approach to publishing reproducible research that reduced the barriers for researchers while also promoting good programming practices.

My first paper using that approach was recently accepted for publication with the Journal of Climate (see the post-print here on Authorea) and the Bulletin of the American Meteorological Society have just accepted an essay I’ve written explaining the rationale behind the approach. In a nutshell, it requires the author to provide three key supplementary items:

  1. A description of the software packages and operating system used
  2. A (preferably version controlled and publicly accessible) code repository, and
  3. A collection of supplementary log files that capture the data processing steps taken in producing each key result

The essay then goes on to suggest how academic journals (and institutions that have an internal review process) might implement this as a formal minimum standard for the communication of computational results. I’ve contacted the American Meteorological Society (AMS) Board on Data Stewardship about this proposed minimum standard (they’re the group who decide the rules that AMS journals impose around data and code availability) and they’ve agreed to discuss it when they meet at the AMS Annual Meeting in January.

This is where you come in. I’d really love to find a few volunteers who would be willing to try and meet the proposed minimum standard when they write their next journal paper. These volunteers could then give feedback on the experience, which would help inform the Board on Data Stewardship in developing a formal policy around code availability. If you think you might like to volunteer, please get in touch!

 

September 4, 2015 / Damien Irving

Managing your data

If you’re working on a project that involves collecting (e.g. from a network of weather stations) or generating (e.g. running a model) data, then it’s likely that one of the first things you did was develop a data management plan. Many funding agencies (e.g. the National Science Foundation) actually formally require this, and such plans usually involve outlining your practices for collecting, organising, backing up, and storing the data you’ll be generating.

What many people don’t realise is that even if you aren’t collecting or generating your own data (e.g. you might simply download a reanalysis or CMIP5 dataset), you should still start your project by developing a data management plan. That plan obviously doesn’t need to consider everything a data collection/generation project does (e.g. you don’t need to think about archiving the data at a site like Figshare), but there are a few key things all data analysis projects need to consider, regardless of whether they collected and/or generated the original data or not.
 
1. Data Reference Syntax

The first thing to define is your Data Reference Syntax (DRS) – a convention for naming your files. As an example, let’s look at a file from the data archive managed by Australia’s Integrated Marine Observing System (IMOS).

.../thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz

That’s a lot of information to take in, so let’s focus on the structure of the file directory first:

.../thredds/dodsC/<project>/<organisation>/<collection>/<facility>/<data-type>/<site-code>/<year>/

From this we can deduce, without even inspecting the contents of the file, that we have data from the IMOS project that is run by the eMarine Information Infrastructure (eMII). It was collected in 2012 at the Turquoise Coast, Western Australia (TURQ) site of the Australian Coastal Ocean Radar Network (ACORN), which is a network of high frequency radars that measure the ocean surface current. The data type has a sub-DRS of its own, which tells us that the data represents the 1-hourly average surface current for a single month (October 2012), and that it is archived on a regularly spaced spatial grid and has not been quality controlled. The file is located in the “demos” directory, as it has been generated for the purpose of providing an example for users at the very helpful Australian Ocean Data Network user code library.

Just in case the file gets separated from this informative directory structure, much of the information is repeated in the file name itself, along with some more detailed information about the start and end time of the data, and the last time the file was modified:

<project>_<facility>_V_<time-start>_<site-code>_FV00_<data-type>_<time-end>_<modified>.nc.gz

In the first instance this level of detail seems like a bit of overkill, but consider the scope of the IMOS data archive. It is the final resting place for data collected by the entire national array of oceanographic observing equipment in Australia, which monitors the open oceans and coastal marine environment covering physical, chemical and biological variables. Since the data are so well labelled, locating all monthly timescale ACORN data from the Turquoise Coast and Rottnest Shelf sites (which represents hundreds of files) would be as simple as typing the following at the command line:

$ ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc

While it’s unlikely that your research will ever involve cataloging data from such a large observational network, it’s still a very good idea to develop your own personal DRS for the data you do have. This often involves investing some time at the beginning of a project to think carefully about the design of your directory and file name structures, as these can be very hard to change later on. The combination of bash shell wildcards and a well planned DRS is one of the easiest ways to make your research more efficient and reliable.
 
2. Data provenance

In defining my own DRS, I added some extra fields to cater for the intermediary files that typically get created throughout the data analysis process. For instance, I added a field to indicate the temporal aspects of the data (e.g. to indicate if the data are an anomaly relative to some base period) and another for the spatial aspects (e.g. to indicate whether the data have been re-gridded). While keeping track of this information via the DRS is a nice thing to do (it definitely helps with bash wildcards and visual identification of files), more detailed information needs to be recorded for the data to be truly reproducible. A good approach to recording such information is the procedure followed by the Climate Data Operators (CDO) and NetCDF Operators (NCO). Whenever an NCO or CDO utility (e.g. ncks, ncatted, cdo mergetime) is executed at the command line, a time stamp followed by a copy of the command line entry is automatically appended to the global attributes of the output netCDF file, thus maintaining a complete history of the data processing steps. Here’s an example:

Tue Jun 30 07:35:49 2015: cdo runmean,30 va_ERAInterim_500hPa_daily_native.nc va_ERAInterim_500hPa_030day-runmean_native.nc

You might be thinking, “this is all well and good, but what about data processing steps that don’t use NCO, CDO or even netCDF files?” It turns out that if you write a script (e.g. in Python, R or whatever language you’re using) that can be executed from the command line, then it only takes an extra few lines of code to parse the associated command line entry and append that information to the global attributes of a netCDF file (or a corresponding metadata text file if dealing with file formats that don’t carry their metadata with them). To learn how to do this using Python, check out the Software Carpentry lesson on Data Management in the Ocean, Weather and Climate Sciences.
 
3. Backup

Once you’ve defined your DRS and have implemented the NCO/CDO approach to data provenance, the final thing to think about is backing up your data. This is something I’ve discussed in detail in a previous post, but the crux of the story is that if your starting point files (i.e. the data files required at the very first step of your data processing) can be easily downloaded (e.g. reanalysis or CMIP5 data), then you probably don’t need your local copy to be backed up. All of your code should be version controlled and backed up via an external hosting service like GitHub and Bitbucket, so you can simply re-download the data and re-run your analysis scripts if disaster strikes. If you generated your starting point files from scratch on the other hand (e.g. you collected weather observations or ran a model that would take months to re-run), then backup is absolutely critical and would be part of your data management plan.

 

June 3, 2015 / Damien Irving

The CWSLab workflow tool: an experiment in community code development

Give anyone working in the climate sciences half a chance and they’ll chew your ear off about CMIP5. It’s the largest climate modelling project ever conducted and formed the basis for much of the IPCC Fifth Assessment Report, so everyone has an opinion on which are the best models, the level of confidence we should attach to projections derived from the models, etc, etc. What they probably won’t tell you about is the profound impact that CMIP5 has had on climate data processing and management. In the lead up to CMIP5 (2010/11), I was working at CSIRO in a support scientist role. When I think back on that time, I refer to it as The Great Data Duplication Panic. In analysing output from CMIP3 and earlier modelling projects, scientists simply downloaded data onto their local server (or even personal computer) and did their own analysis in isolation. At the CSIRO Aspendale campus alone there must have been a dozen copies of the CMIP3 dataset floating around. Given its sheer size (~3 PetaBytes!), we recognised very quickly that this kind of data duplication just wasn’t going to fly for CMIP5.

Support scientists at CSIRO and the Bureau of Meteorology were particularly panicked about two types of data duplication: download duplication (i.e. duplication of the original dataset) and processing duplication (e.g. duplication of similarly processed data such as a common horizontal regridding or extraction of the Australian region). It was out of this panic that the Climate and Weather Science Laboratory (CWSLab) was born (although it wasn’t called that back then).

Download duplication

The download duplication problem has essentially been addressed by two major components of the CWSLab project. The NCI data library stores a variety of local and international climate and weather datasets (including CMIP5), while the NCI computational infrastructure is built directly on top of that library so you can do your data processing in situ (i.e. as opposed to downloading the data to your own machine). The computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have the CWS Virtual Desktops. These virtual desktops have more grunt than your personal laptop or desktop (4 CPUs, 20 GB RAM, 66 GB storage) and were deemed the best way to provide scientists with remote access to data exploration tools like MATLAB and UV-CDAT that involve a graphical user interface.

While solving the download duplication problem has been a fantastic achievement, it was aided by the fact that the solution didn’t require everyday climate scientists to change their behaviour in any appreciable way. They simply login to a machine at NCI rather than their local server and proceed with their data analysis as per normal. The processing duplication problem on the other hand will require a change in behaviour and may therefore be more difficult to solve…

Processing duplication

The CWSLab answer to the processing duplication problem is the CWSLab workflow tool, which can be run from the CWS Virtual Desktop. The tool is a plugin/add-on to the VisTrails workflow and provenance management system (see this previous post for a detailed discussion of workflow automation) and allows you to build, run and capture metadata for analyses involving the execution of multiple command line programs (see this example Nino 3.4 workflow). The code associated with the VisTrails plugin is hosted in three separate public GitHub repositories:

  • cwsl-ctools: A collection of command line programs used in performing common climate data analysis tasks. The programs can be written in any programming language, they just have to be able to parse the command line.
  • cwsl-mas: The source code for the plugin. In essence, it contains a wrapper for each of the command line programs in the cwsl-ctools repo which tells VisTrails how to implement that program.
  • cwsl-workflows: A collection of example workflows that use the VisTrails plugin.

The CWSLab workflow tool writes output files using a standardised data reference syntax, which is how it’s able to solve the processing duplication problem. For instance, if someone has already regridded the ACCESS1-0 model to a 1 by 1 degree global grid, the system will be able to find that file rather than re-creating/duplicating it.

A community model for help and code development

Unlike the NCI infrastructure and data library which have dedicated staff, the group of support scientists behind the VisTrails plugin have very small and infrequent time allocations on the project. This means that if the workflow tool is to succeed in the long run, all scientists who do climate model analysis at NCI will need to pitch in on both code development and requests for help.

Fortunately, GitHub is perfectly setup to accommodate both tasks. Scientists can “fork” a copy of the cwsl code repositories to their own GitHub account, make any changes to the code that they’d like to see implemented (e.g. a new script for performing linear regression), and then submit a “pull request” to the central cwsl repo. The community can then view the proposed changes and discuss them before finally accepting or rejecting. Similarly, instead of a help desk, requests for assistance are posted to the cwsl-mas chat room on Gitter. These rooms are a new feature associated with GitHub code repositories that are specifically designed for chatting about code. People post questions, and anyone in the community who knows the answer can post a reply. If the question is too long/complex for the chat room, it can be posted as an issue on the relevant GitHub repo for further community discussion.

Multiple birds with one stone

By adopting a community approach, the workflow tool addresses a number of other issues besides data duplication.

  • Code review. Software developers review each other’s code all the time, but scientists never do. The Mozilla Science Lab have now run two iterations of their Code Review for Scientists project to figure out when and how scientific code should be reviewed, and their findings are pretty clear. Code review at the completion of a project (e.g. when you submit a paper to a journal) is fairly pointless, because the reviewer hasn’t been intimately involved in the code development process (i.e. they can make cosmetic suggestions but nothing of substance). Instead, code review needs to happen throughout a scientific research project. The pull request system used by the CWSLab workflow tool allows for this kind of ongoing review.
  • Code duplication. Any scientist that is new to climate model data analysis has to spend a few weeks (at least) writing code to do basic data input/output and processing. The cwsl-ctools repo means they no longer need to reinvent the wheel – they have access to a high quality (i.e. lots of people have reviewed it) code repository for all those common and mundane data analysis tasks.
  • Reproducible research. The reproducibility crisis in computational research has been a topic of conversation in the editorial pages of Nature and Science for a number of years now, however very few papers in today’s climate science journals include sufficient documentation (i.e. details of the software and code used) for readers to reproduce key results. The CWSLab workflow tool automatically captures detailed metadata about a given workflow (right down to the precise version of the code that was executed; see here for details) and therefore makes the generation of such documentation easy.

Conclusion

The CWSLab workflow tool is an ambitious and progressive initiative that will require a shift in the status quo if it is to succeed. Researchers will need to overcome the urge to develop code in isolation and the embarrassment associated with sharing their code. They’ll also have to learn new skills like version control with git and GitHub and how to write scripts that can parse the command line. These things are not impossible (e.g. Software Carpentry teaches command line programs and version control in a single afternoon) and the benefits are clear, so here’s hoping it takes off!

April 6, 2015 / Damien Irving

Workflow automation

In previous posts (e.g. What’s in your bag?) I’ve discussed the various tools I use for data analysis. I use NCO for making simple edits to the attributes of netCDF files, CDO for routine calculations on netCDF files and a whole range of Python libraries for doing more complicated analysis and visualisation. In years gone by I’ve also included NCL and Fortran in the mix. Such diversity is pretty common (i.e. almost nobody uses a single programming language or tool for all their analysis) so this post is my attempt at an overview of workflow automation. In other words, how should one go about tying together the various tools they use to produce a coherent, repeatable data analysis pipeline?

The first thing to note is that the community has not converged on a single best method for workflow automation. Instead, there appears to be three broad options depending on the complexity of your workflow and the details of the tools involved:

  1. Write programs that act like any other command line tool and then combine them with a shell script or build manager
  2. Use an off-the-shelf workflow management system
  3. Write down the processing steps in a lab notebook and re-execute them manually

Let’s consider these approaches one by one:

 

1. Command line

Despite the fact that its commands are infuriatingly terse and cryptic, the Unix shell has been around longer than most of its users have been alive. It has survived so long because of the ease with which (a) repetitive tasks can be automated and (b) existing programs can be combined in new ways. Given that NCO and CDO are command line tools (i.e. you’re probably going to be using the command line anyway), it could be argued that the command line is the most natural home for workflow automation in the weather and climate sciences. For instance, in order to integrate my Python scripts with the rest of my workflow, I use the argparse library to make those scripts act like any other command line program. They can be executed from the command line, ingest arguments and options from the command line, combine themselves with other command line programs via pipes and filters, and output help information just like a regular command line program.

Armed with my collection of command line-native Python scripts, the easiest way to link multiple processing steps is to store them in a shell script. For instance, I could execute the following hypothetical workflow by storing all the steps (i.e. command line entries) in a shell script called run-streamfunction.sh.

  1. Edit the “units” attribute of the original zonal and meridional wind data netCDF files (NCO)
  2. Calculate the streamfunction from the zonal and meridional wind (calc_streamfunction.py)
  3. Calculate the streamfunction anomaly by subtracting the climatological mean at each timestep (CDO)
  4. Apply a 30 day running mean to the streamfunction anomaly data (CDO)
  5. Plot the average streamfunction anomaly for a time period of interest (plot_streamfunction.py)

This would be a perfectly valid approach if I was dealing with a small dataset, but let’s say I wanted to process 6 hourly data from the JRA-55 reanalysis dataset over the period 1958-2014 for the entire globe. The calc_streamfunction.py script I wrote would take days to run on the server in my department in this situation, so I’d rather not execute every single step in run-streamfunction.sh every time I change the time period used for the final plot. What I need is a build manager – a smarter version of run-streamfunction that can figure out whether previous steps have already been executed and if they need to be updated.

The most widely used build manager on Unix and its derivatives is called Make. Like the Unix shell it is old, cryptic and idiosyncratic, but it’s also fast, free and well-documented, which means it has stood the test of time. I started using Make to manage my workflows about a year ago and it has revolutionised the way I work. I like it because of the documentation and also the fact that it’s available no matter what machine I’m on, however there are other options (e.g. doit, makeflow, snakemake, ruffus) if you’d like something a little less cryptic.

To learn how to apply the command line approach to your own workflow automation, check out these Software Carpentry lessons:

 

2. Workflow management systems

The command line discussion above suggests the use of shell scripts for automating small, simple data processing pipelines, and build managers like Make and doit for pipelines that are either slightly more complicated or have steps that you’d rather not repeat unnecessarily (e.g. steps that take many hours/days to run). For many weather and climate scientists (myself included), this is as far as you’ll need to go. Make and doit have all the functionality you’ll ever really need for automatically executing a data analysis pipeline, and by following the process laid out in the data management lesson linked to above you’ll be able to document that pipeline (i.e. produce a record of the provenance of your data).

But what if you’re working on a project that is too big and complicated for a simple Makefile or two? The management of uber-complex workflows such as those associated with running coupled climate models or processing the whole CMIP5 data archive can benefit greatly from specialised workflow management systems like VisTrails, pyRDM, Sumatra or Pegasus. These systems can do things like manage resource allocation for parallel computing tasks, execute steps that aren’t run from the command line, automatically publish data to a repository like Figshare and produce nice flowcharts and web interfaces to visualise the entire workflow.

I’ve never used one of these systems, so I’d love to hear from anyone who has. In particular, I’m curious to know whether such tools could be used for smaller/simpler workflows, or whether the overhead associated with setting up and learning the system cancels out any benefit over simpler options like Make and doit.

 

3. The semi-manual approach

While writing command line programs is a relatively simple and natural thing to do in Python, it’s not the type of workflow that is typically adopted by users of more self-contained environments like MATLAB and IDL. From my limited experience/exposure, it appears that users of these environments tend not to link the data processing that happens within MATLAB or IDL with processing that happens outside of it. For instance, they might pre-process their data using NCO or CDO at the command line, before feeding the resulting data files into MATLAB or IDL to perform additional data processing and visualisation. This break in the workflow implies that some manual intervention is required to check whether previous processing steps need to be executed and to initiate the next step in the process (i.e. something that Make or doit would do automatically). Manual checking and initiation is not particularly problematic for smaller pipelines, but can be error prone (and time consuming) as workflows get larger and more complex.

Since I’m by no means an expert in MATLAB or IDL, I’d love to hear how regular users of those tools manage their workflows.

January 31, 2015 / Damien Irving

Plugging into the computational best practice discussion

From the time we write our first literature review as a graduate or honours year student, we are taught the importance of plugging into the conversation around our research topic of interest. We subscribe to journal email alerts, set up automated searches on databases like Web of Science, join the departmental journal reading club and attend relevant sessions at conferences. If we want to get a job one day, we also figure out pretty quickly which email lists and job boards to keep an eye on (see this post for an overview). A discussion that people tend not to be so engaged with, however, is that around computational best practices. Modern weather and climate scientists spend a lot of time writing and debugging code, but discussions around the best tools and tricks for doing this are a little harder to find. As such, here’s my attempt at a consolidated list of the best places to tune in.

Online:

  • Twitter is an absolute gold mine when it comes to quality discussions about computational best practice. Start by following accounts like @swcarpentry, @datacarpentry, @mozillascience, @victoriastodden and @openscience and you’ll soon identify the other key people to follow.
  • Nature has recently started a Toolbox section on its website, which features regular articles about scientific software, apps and online tools. (I recently featured in a question and answer piece about the software used in the weather and climate sciences.)
  • The Mozilla Science Lab Forum hosts all sorts of discussions about open science and computational research.
  • This blog! (and also the PyAOS blog if you’re into Python)

Offline (i.e. in person!):

  • Two-day workshops hosted by Software Carpentry or its new sister organisation Data Carpentry are the perfect place to get up-to-date with the latest tips and tricks. Check their websites for upcoming workshops and if there isn’t one in your region, email them to tee one up for your home department, institution or conference. If you enjoy the experience, stay involved as a volunteer helper and/or instructor and you’ll always be a part of the computational conversation.
  • Don’t tell my colleagues, but I’ve always found scientific computing conferences like SciPy, PyCon or the Research Bazaar conference to be way more useful than the regular academic conferences I usually go to. (Check out my reflections on PyCon Australia here)
  • Local open science and/or data science meetups are really starting to grow in popularity. For instance, the Research Bazaar project hosts a weekly “Hacky Hour” at a bar on campus at the University of Melbourne, while a bunch of scientists have got together to form Data Science Hobart in Tasmania. If such a meetup doesn’t exist in your area, get a bunch of colleagues together and start one up!
October 30, 2014 / Damien Irving

Software installation explained

Software installation has got to be one of the most frustrating, confusing and intimidating things that research scientists have to deal with. In fact, I’ve waxed poetic in the past (see this post) about the need to solve the software installation problem. Not only is it a vitally important issue for the sanity of everyday research scientists, but it’s also critically important to the open science movement. What’s the point of having everyone share their code and data, if nobody can successfully install the software that code depends on? This post is my attempt to summarise the current state of play regarding software installation in the weather and climate sciences. Things are far from perfect, but there are some encouraging things happening in this space.

 

In Theory…

There are four major ways in which you might go about installing a certain software package. From easiest to hardest, they go as follows:

1. Download an installer

This is the dream scenario. Upon navigating to the website of the software package you’re after, you discover a downloads page which detects your operating system and presents you with a link to download the appropriate installer (sometimes called a “setup program”). You run the installer on your machine, clicking yes to agree to the terms and conditions and checking the box to include a shortcut on your desktop, and hey presto the software works as advertised. If you’re using a proprietary package like MATLAB or IDL then this has probably been your experience. It takes many developer hours to create, maintain and support software installers, so this is where (some of) your license fees are going. Free software that is very widely used (e.g. Git) is also often available via an installer, however in most cases you get what you pay for when it comes to software installation…

2. Use a package manager

In the absence of an installer, your next best bet is to see whether the software you’re after is available via a package manager. All Linux operating systems have a package manager based on apt-get (e.g. the Ubuntu Software Centre), while there are a range of different managers available for Mac (e.g. Homebrew) and Windows (e.g. OneGet will come standard with Windows 10). The great thing about these managers is that they handle all the software dependencies associated with an install. For instance, if the command line tool you’re installing allows for the manipulation of netCDF files, then chances are that tool depends on the relevant netCDF libraries being installed on your machine too. Package managers are smart enough to figure this out, and will install all the dependencies along the way. They will also alert you to software updates (and install them for you if you like), which means in many cases a package manager install might even be preferable to downloading an installer.

The only downside to package managers is that there is often a time lag between when a new version of a software package is released and when it gets updated on the manager.  If you want the “bleeding edge” version of a particular software package or if that package isn’t available via a package manager (only fairly widely used packages make it to that stage), then you slide further down the list to option 3…

3. Install from binaries

We are now beyond the point of just clicking a button and having the install happen before our eyes, so we need to learn a little more about software installation to figure out what’s going on. At the core of any software is the source code, which is usually a bunch of text files (e.g. like .c, .cpp, .h in case of software written in C/C++). In order to run that source code, you must first feed it through a compiler. Compiling then generates a binary, which is usually an .exe or a .dll file. To relieve users of the burden of having to compile the code themselves, software developers will often collect up all the relevant binaries in a zip file (or tarball) and make them available on the web (e.g. on a website like SourceForge). You then just have to unzip those binaries in an appropriate location on your machine. This sounds easy enough in theory, but in order to get the software working correctly there’s often an extra step – you essentially have to do the work of a package manager and install the software dependencies as well. This is occasionally impossible and almost always difficult.

(Note that an installer is basically just a zip file full of binaries that can unzip itself and copy the binaries to the right places on your computer.)

4. Install from the source code

If you’re feeling particularly brave and/or need the very latest version of a software package (e.g. perhaps a beta-version that hasn’t even been formally released yet), you can often download the source code from a site like GitHub. You now have to do the compilation step yourself, so there’s an added degree of difficulty. It turns out that even super experienced programmers avoid source code installs unless they absolutely have to.

 

In Practice…

Ok, so that’s a nice high level summary of the software installation hierarchy, but how does it actually play out in reality? To demonstrate, consider my personal software requirements (see this post for details):

  • NCO for simple manipulation of netCDF files
  • CDO for simple data analysis tasks on netCDF files
  • Python for more complex data analysis tasks
  • UV-CDAT for quickly viewing the contents of netCDF files

This is how the installation of each of these packages plays out on a modern Ubuntu, Mac and Windows machine (I keep a running log of my software installation troubles and solutions here if you’re interested):

NCO & CDO

NCO and CDO are available via both the Ubuntu Software Centre and Homebrew, so installation on Ubuntu and Mac is a breeze (although there are a few bugs with the Homebrew install for CDO). Things are a little more difficult for Windows. There are binaries available for both, however it doesn’t appear that the CDO binaries are particularly well supported.

Python

Getting the Python standard library (i.e. the core libraries that come with any Python installation) working on your machine is a pretty trivial task these days. In fact, it comes pre-installed on Ubuntu and Mac. Until recently, what wasn’t so easy was getting all the extra libraries relevant to the weather and climate sciences playing along nicely with the standard library. The problem stems from the fact that while the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C (because speed is important when data arrays get really large). Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite Python libraries working together was problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released (for free) Anaconda, which bundles together around 200 of the most popular Python libraries for science, maths, engineering and data analysis. What’s more, if you need a library that isn’t part of the core 200 and can’t be installed easily with pip, then they have developed their own package manager called conda (see here and here for some great background posts about conda). People can write conda packages for their favourite Python libraries (which is apparently a fairly simple task for experienced developers) and post them on anaconda.org, and those conda packages can be used to install the libraries (and all their dependencies) on your own machine.

In terms of my personal Python install, the main extra libraries I care about are iris and cartopy (for plotting), xarray (for climate data analysis), eofs (for EOF analysis) and windspharm (for wind related quantities like the streamfunction and velocity potential). There are Linux, Mac and Windows flavoured conda packages for all four at the fantastic IOOS channel at anaconda.org, so installing them is as simple as entering something like this at the command line:

conda install -c http://conda.anaconda.org/ioos xarray

The availability of these packages for all three operating systems is something that has only happened very recently and won’t necessarily be the case for less widely used packages. The pattern I’ve noticed is that Linux packages tend to appear first, followed by Mac packages soon after. Widely used packages eventually get Windows packages as well, but in many cases this can take a while (if at all).

UV-CDAT

UV-CDAT has binaries available for Ubuntu and Mac, in addition to binaries for the dependencies (which is very nice of them). There are no binaries for Windows at this stage.

 

In Conclusion…

If you’re struggling when it comes to software installation, rest assured you definitely aren’t alone. The software installation problem is a source of frustration for all of us and is a key roadblock on the path to open science, so it’s great that solutions like anaconda.org are starting to pop up. In the meantime (i.e. while you’re waiting for a silver bullet solution), probably the best thing you can do is have a serious think about your operating system. I don’t like to take sides when it comes to programming languages, tools or operating systems, but the reality (as borne out in the example above) is that developers work on Linux machines, which means they first and foremost make their software installable on Linux machines. Macs are an afterthought that they do often eventually get around to (because Mac OS X is based on Linux so it’s not too hard), while Windows is an after-afterthought that often never gets addressed (because Windows is not Linux-based and is therefore often too hard) unless you’re dealing with a proprietary package that can afford the time and effort. If you want to make your life easy when it comes to scientific computing in the weather and climate sciences, you should therefore seriously consider working on a Linux machine, or at least on a Mac as a compromise.

August 28, 2014 / Damien Irving

Speeding up your code

In today’s modern world of big data and high resolution numerical models, it’s pretty easy to write a data analysis script that would take days/weeks (or even longer) to run on your personal (or departmental) computer. With buzz words like high performance computing, cloud computing, vectorisation, supercomputing and parallel programming floating around, what’s not so easy is figuring out the best course of action for speeding up that code. This post is my attempt to make sense of all the options…

 

Step 1: Vectorisation

The first thing to do with any slow script is to use a profiling tool to locate exactly which part/s of the code are taking so long to run. All programming languages have profilers, and they’re normally pretty simple to use. If your code is written in a high-level language like Python, R or Matlab, then the bottleneck is most likely a loop of some sort (e.g. a “for” or “while” loop). These languages are designed such that the associated code is relatively concise and easy for humans to read (which speeds up the code development process), but the trade-off is that they’re relatively slow for computers to run, especially when it comes to looping.

If a slow loop is at fault, then your first course of action should be to see if that loop can be vectorised. For instance, let’s say you’ve got some temperature data on a time/latitude/longitude grid, and you want to convert all the values from Kelvin to Celsius. You could loop through the three dimensional grid and subtract 273.15 at each point, however if your data came from a high resolution global climate model then this could take a while. Instead, you should take advantage of the fact that your high-level programming language almost certainly supports vectorised operations (aka. array programming). This means there will be a way to apply the same operation to an entire array of data at once, without looping through each element one by one. In Python, the NumPy extension supports array operations. Under the hood NumPy is actually written in a low-level language called C, but as a user you never see or interact with that C code (which is lucky because low-level code is terse and not so human friendly). You simply benefit from the speed of C (and the years of science that has gone into optimising array operations), with the convenience of coding in Python.

If you get creative – and there are lots of useful NumPy routines (and their equivalents in other languages) out there to help with this – almost any loop can be vectorised. If for some reason your loop isn’t amenable to vectorisation (e.g. you might be making quite complicated decisions at each grid point using lots of “if” statements), then another option would be to re-write that section of code in a low-level language like C or Fortran. Most high-level languages allow you to interact with functions written in C or Fortran, so you can then incorporate that function into your code.

 

Step 2: Exploit the full capabilities of your hardware

Ok, so let’s say you’ve vectorised all the loops that you can, you’ve written any other slow loops in a low-level language, and your code is still too slow (or perhaps you can’t be bothered learning a low-level language and have skipped that option, which is totally understandable). The next thing to try is parallel programming.

One of the motivations for parallel programming has been the diminishing marginal increases in single CPU performance with each new generation of Central Processing Unit (CPU). In response, computer makers have introduced multi-core processors that contain more than one processing core. Most desktop computers, laptops, and even tablets and smart phones have two or more CPU cores these days (e.g. devices are usually advertised as “duo-core” or “quad-core”). In addition to multi-core CPUs, Graphics Processing Units (GPU) have become more powerful recently. GPUs are increasingly being used not just for drawing graphics to the screen, but for general purpose computation.

By default, your Python, R or Matlab code will run on a single CPU. The level of complexity involved in farming that task out to multiple processors (i.e. multiple CPUs and perhaps even a GPU) depends on the nature of the code itself. If the code is “embarrassingly parallel” (yep, that’s the term they use) then the process is usually no more complicated than renaming your loop. In Matlab, for instance, you simply click a little icon to launch some “workers” (i.e. multiple CPUs) and then change the word “for” in your code to “parfor.” Simple as that.

A problem is embarrassingly parallel so long as there exists no dependency (or need for communication) between the parallel tasks. For instance, let’s say you’ve got a loop that calculates a particular statistical quantity for an array of temperature data from a European climate model, then an American model, Japanese model and finally an Australian model. You could easily farm that task out to the four CPUs on your quad-core laptop, because there is no dependency between each task – the calculation of the statistic for one model doesn’t require any information from the same calculation performed for the other models.

 

Step 3: Consider more and/or better hardware

Ok, so you’ve vectorised your code, re-written other slow loops in a low-level language (or not), and farmed off any embarrassingly parallel parts of the code to multiple processors on your machine. If your code is still too slow, then you’ve essentially reached the limit of your hardware (whether that be your personal laptop/desktop or the server in your local department/organisation) and you’ll need to consider running your code elsewhere. As a researcher your choices for elsewhere are either a supercomputing facility or cloud computing service, and for both there will probably be some process (and perhaps a fee) involved in applying for time/space.

In the case of cloud computing, you’re basically given access to lots of remote computers (in fact, you’ll probably have no idea where those computers are physically located) that are connected via a network (i.e. the internet). In many cases these computers are no better or more advanced than your personal laptop, however instead of being limited to one quad-core machine (for instance) you can now have lots of them. It’s not hard to imagine that this can be very useful for embarrassingly parallel problems like the one described earlier. There are about 50 climate models in the CMIP5 archive (i.e. many more than just a single European, American, Japanese and Australian model), but these could all be analysed at once with access to a dozen quad-core machines. There are tools like the Matlab Distributed Compute Service to help deal with the complexities associated with running code across multiple machines at once (i.e. a cluster), so it’s really not much more difficult than using multiple cores on your own laptop.

The one thing we haven’t considered so far is parallel computing for non-embarrassing problems like running a complex global climate model (as opposed to analysing the output). The calculation of the surface temperature at each latitude/longitude point as time progresses, for instance, depends on the temperature, humidity and sunshine on the days prior and also the temperature, humidity and sunshine at the surrounding grid-points. These are known as distributed computing problems, because each process happening in parallel needs to communicate/share information with other processes happening at the same time. Cloud computing isn’t great in this case, because the processors involved aren’t typically very close to one another and communication across the network is relatively slow. In a supercomputer on the other hand, all the processors are very close to one another and communication is very fast. There’s a real science behind distributed computing, and it typically requires a total re-think of the way your code and problem is structured/framed (i.e. you can’t just replace “for” with “parfor”). To cut a long story short, if your problem isn’t embarrassingly parallel and distributed computing is the answer to your problem, there won’t be a simple tool to help you out. You’re going to need professional assistance.

(Note: This isn’t to say that supercomputers aren’t good for embarrassingly parallel problems too. A supercomputer has thousands and thousands of cores – more than enough to solve most problems fast. It’s just that in theory you have access to more cores in cloud computing, because you can just keep adding more machines. If you’re dealing with the volumes of data that Amazon or Google do, then this is an important distinction between cloud and super-computing.)

 

Concluding remarks

If you’re a typical weather/climate scientist, then well vectorised code written in a high-level language will run fast enough for pretty much any research problem you’d ever want to tackle. In other words, for most use cases speed depends most critically on how the code is written, not what language it’s written in or how/where it’s run. However, if you do find yourself tackling more complex problems, it’s important to be aware of the options available for speeding up your code and the level of complexity involved. Farming out an embarrassingly parallel problem to the four CPUs on your local machine is probably worth the small amount of time and effort involved in setting it up, whereas applying for access to cloud computing before you’ve exhausted the options of vectorisation and local parallel computing would probably not be a wise investment of your time, particularly if the speed increases aren’t going to be significant.