Skip to content
June 3, 2015 / Damien Irving

The CWSLab workflow tool: an experiment in community code development

Give anyone working in the climate sciences half a chance and they’ll chew your ear off about CMIP5. It’s the largest climate modelling project ever conducted and formed the basis for much of the IPCC Fifth Assessment Report, so everyone has an opinion on which are the best models, the level of confidence we should attach to projections derived from the models, etc, etc. What they probably won’t tell you about is the profound impact that CMIP5 has had on climate data processing and management. In the lead up to CMIP5 (2010/11), I was working at CSIRO in a support scientist role. When I think back on that time, I refer to it as The Great Data Duplication Panic. In analysing output from CMIP3 and earlier modelling projects, scientists simply downloaded data onto their local server (or even personal computer) and did their own analysis in isolation. At the CSIRO Aspendale campus alone there must have been a dozen copies of the CMIP3 dataset floating around. Given its sheer size (~3 PetaBytes!), we recognised very quickly that this kind of data duplication just wasn’t going to fly for CMIP5.

Support scientists at CSIRO and the Bureau of Meteorology were particularly panicked about two types of data duplication: download duplication (i.e. duplication of the original dataset) and processing duplication (e.g. duplication of similarly processed data such as a common horizontal regridding or extraction of the Australian region). It was out of this panic that the Climate and Weather Science Laboratory (CWSLab) was born (although it wasn’t called that back then).

Download duplication

The download duplication problem has essentially been addressed by two major components of the CWSLab project. The NCI data library stores a variety of local and international climate and weather datasets (including CMIP5), while the NCI computational infrastructure is built directly on top of that library so you can do your data processing in situ (i.e. as opposed to downloading the data to your own machine). The computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have the CWS Virtual Desktops. These virtual desktops have more grunt than your personal laptop or desktop (4 CPUs, 20 GB RAM, 66 GB storage) and were deemed the best way to provide scientists with remote access to data exploration tools like MATLAB and UV-CDAT that involve a graphical user interface.

While solving the download duplication problem has been a fantastic achievement, it was aided by the fact that the solution didn’t require everyday climate scientists to change their behaviour in any appreciable way. They simply login to a machine at NCI rather than their local server and proceed with their data analysis as per normal. The processing duplication problem on the other hand will require a change in behaviour and may therefore be more difficult to solve…

Processing duplication

The CWSLab answer to the processing duplication problem is the CWSLab workflow tool, which can be run from the CWS Virtual Desktop. The tool is a plugin/add-on to the VisTrails workflow and provenance management system (see this previous post for a detailed discussion of workflow automation) and allows you to build, run and capture metadata for analyses involving the execution of multiple command line programs (see this example Nino 3.4 workflow). The code associated with the VisTrails plugin is hosted in three separate public GitHub repositories:

  • cwsl-ctools: A collection of command line programs used in performing common climate data analysis tasks. The programs can be written in any programming language, they just have to be able to parse the command line.
  • cwsl-mas: The source code for the plugin. In essence, it contains a wrapper for each of the command line programs in the cwsl-ctools repo which tells VisTrails how to implement that program.
  • cwsl-workflows: A collection of example workflows that use the VisTrails plugin.

The CWSLab workflow tool writes output files using a standardised data reference syntax, which is how it’s able to solve the processing duplication problem. For instance, if someone has already regridded the ACCESS1-0 model to a 1 by 1 degree global grid, the system will be able to find that file rather than re-creating/duplicating it.

A community model for help and code development

Unlike the NCI infrastructure and data library which have dedicated staff, the group of support scientists behind the VisTrails plugin have very small and infrequent time allocations on the project. This means that if the workflow tool is to succeed in the long run, all scientists who do climate model analysis at NCI will need to pitch in on both code development and requests for help.

Fortunately, GitHub is perfectly setup to accommodate both tasks. Scientists can “fork” a copy of the cwsl code repositories to their own GitHub account, make any changes to the code that they’d like to see implemented (e.g. a new script for performing linear regression), and then submit a “pull request” to the central cwsl repo. The community can then view the proposed changes and discuss them before finally accepting or rejecting. Similarly, instead of a help desk, requests for assistance are posted to the cwsl-mas chat room on Gitter. These rooms are a new feature associated with GitHub code repositories that are specifically designed for chatting about code. People post questions, and anyone in the community who knows the answer can post a reply. If the question is too long/complex for the chat room, it can be posted as an issue on the relevant GitHub repo for further community discussion.

Multiple birds with one stone

By adopting a community approach, the workflow tool addresses a number of other issues besides data duplication.

  • Code review. Software developers review each other’s code all the time, but scientists never do. The Mozilla Science Lab have now run two iterations of their Code Review for Scientists project to figure out when and how scientific code should be reviewed, and their findings are pretty clear. Code review at the completion of a project (e.g. when you submit a paper to a journal) is fairly pointless, because the reviewer hasn’t been intimately involved in the code development process (i.e. they can make cosmetic suggestions but nothing of substance). Instead, code review needs to happen throughout a scientific research project. The pull request system used by the CWSLab workflow tool allows for this kind of ongoing review.
  • Code duplication. Any scientist that is new to climate model data analysis has to spend a few weeks (at least) writing code to do basic data input/output and processing. The cwsl-ctools repo means they no longer need to reinvent the wheel – they have access to a high quality (i.e. lots of people have reviewed it) code repository for all those common and mundane data analysis tasks.
  • Reproducible research. The reproducibility crisis in computational research has been a topic of conversation in the editorial pages of Nature and Science for a number of years now, however very few papers in today’s climate science journals include sufficient documentation (i.e. details of the software and code used) for readers to reproduce key results. The CWSLab workflow tool automatically captures detailed metadata about a given workflow (right down to the precise version of the code that was executed; see here for details) and therefore makes the generation of such documentation easy.

Conclusion

The CWSLab workflow tool is an ambitious and progressive initiative that will require a shift in the status quo if it is to succeed. Researchers will need to overcome the urge to develop code in isolation and the embarrassment associated with sharing their code. They’ll also have to learn new skills like version control with git and GitHub and how to write scripts that can parse the command line. These things are not impossible (e.g. Software Carpentry teaches command line programs and version control in a single afternoon) and the benefits are clear, so here’s hoping it takes off!

Advertisements

3 Comments

Leave a Comment
  1. hypergeometric / Jun 3 2015 13:30

    Reblogged this on Hypergeometric.

  2. davidfratantoni / Jun 4 2015 05:03

    Reblogged this on David Fratantoni's Blog.

Trackbacks

  1. A vision for CMIP6 in Australia | Dr Climate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: