Skip to content
February 15, 2017 / Damien Irving

The research police

You know who I’m talking about. I’m sure every research community has them. Those annoying do-gooders who constantly advocate for things to be done the right way. When you’re trying to take a shortcut, it’s their nagging voice in the back of your mind. You appreciate that what they’re saying is important, but with so much work and so little time, you don’t always want to hear it. Since I’m fond of creating lists on this blog, here’s my research police of the weather, ocean and climate sciences:

 

Statistics

Dan Wilks is a widely regarded statistics guru in the atmospheric sciences. He is the author of the most clearly written statistics textbook I’ve ever come across, as well as great articles such as this recent essay in BAMS, which is sure to make you feel bad if you’ve ever plotted significance stippling.

 

Data visualisation

Ed Hawkins’ climate spiral visualisation received worldwide media coverage in 2016 (and even featured in the opening ceremony of the Rio Olympics). He makes the list of research police due to his end the rainbow campaign, which advocates for the use of more appropriate colour scales in climate science.

 

Communication

David Schultz is the Chief Editor of Monthly Weather Review and has authored well over 100 research articles, but is probably best known as the “Eloquent Science guy.” His book and blog are a must read for anyone wanting to improve their academic writing, reviewing and speaking.

 

Programming

Unfortunately I’m going to have to self-nominate here, as I’ve been a strong advocate for publishing reproducible computational results for a number of years now (see related post and BAMS essay). To help researchers do this, I’ve taught at over 20 Software Carpentry workshops and I’m the lead author of their climate-specific lesson materials.

 

If I’ve missed any other research police, please let me know in comments!

January 11, 2017 / Damien Irving

Need help with reproducible research? These organisations have got you covered.

The reproducibility crisis in modern research is a multi-faceted problem. If you’re working in the life sciences, for instance, experimental design and poor statistical power are big issues. For the weather, ocean and climate sciences, the big issue is code and software availability. We don’t document the details of the code and software used to analyse and visualise our data, which means it’s impossible to interrogate our methods and reproduce our results.

(For the purposes of this post, research “software” is something that has been packaged and released for use by the wider community, whereas research “code” is something written just for personal use. For instance, I might have written some code to perform and plot an EOF analysis, which calls and executes functions from the eofs software package that is maintained by Andrew Dawson at Oxford University.)

Unbeknown to most weather, ocean and climate scientists, there are a number of groups out there that want to help you make your work more reproducible. Here’s a list of the key players and what they’re up to…

 

Software Sustainability Institute (SSI)

The SSI is the go-to organisation for people who write and maintain scientific software. They provide training and support, advocate for formal career paths for scientific software developers and manage the Journal of Open Research Software, where you can publish the details of your software so that people can cite your work. They focus mainly on researchers in the UK, so it’s my hope that organisations like SSI will start popping up in other countries around the world.

 

OntoSoft

The OntoSoft project in the US has a bit of overlap with the SSI (e.g. they’re working on “software commons” infrastructure where people can submit their geoscientific software so that it can be searched and discovered by others), but in addition their Geoscientific Paper of the Future (GPF) initiative has been looking at the broader issue of how researchers should go about publishing the details of the digital aspects of their research (i.e. data, code, software and provenance/workflow). In a special GPF issue of Earth and Space Science, researchers from a variety of geoscience disciplines share their experiences in trying to document their digital research methods. The lead paper from that issue gives a fantastic overview of the options available to researchers. (My own work in this area gives a slightly more practical overview but in general covers many of the same ideas.)

 

Software Carpentry

The global network of volunteer Software Carpentry instructors run hundreds of two-day workshops around the world each year, teaching the skills needed to write reusable, testable and ultimately reproducible code (i.e. to do the things suggested by the GPF). Their teaching materials have been developed and refined for more than a decade and every instructor undergoes formal training, which means you won’t find a better learning experience anywhere. To get a workshop happening at your own institution, you simply need to submit a request at their website. They’ll then assist with finding local instructors and all the other logistics that go along with running a workshop. A sibling organisation called Data Carpentry has recently been launched, so it’s also worth checking to see if their more discipline-specific, data-centric lessons would be a better fit.

 

Mozilla Science Lab

Once you’ve walked out of a two-day Software Carpentry workshop, it can be hard to find ongoing support for your coding. The best form of support usually comes from an engaged and well connected local community, so the Mozilla Science Lab assists researchers in forming and maintaining in-person study groups. If there isn’t already a study group in your area, I’d highly recommend their study group handbook. It has a bunch of useful advice and resources for getting one started, plus they periodically run online orientation courses to go through the handbook content in detail.

 

Hopefully one or more of those organisations will be useful in your attempts to make your work more reproducible – please let me know in comments if there’s other groups/resources that I’ve missed!

 

October 4, 2016 / Damien Irving

The weather/climate Python stack

It would be an understatement to say that Python has exploded onto the data science scene in recent years. PyCon and SciPy conferences are held somewhere in the world every few months now, at which loads of new and/or improved data science libraries are showcased to the community. When the videos from these conferences are made available online (which is almost immediately at pyvideo.org), I’m always filled with a mixture of joy and dread. The ongoing rapid development of new libraries means that data scientists are (hopefully) continually able to do more and more cool things with less and less time and effort, but at the same time it can be difficult to figure out how they all relate to one another. To assist in making sense of this constantly changing landscape, this post summarises the current state of the weather and climate Python software “stack” (i.e. the collection of libraries used for data analysis and visualisation). My focus is on libraries that are widely used and that have good (and likely long-term) support, but I’m happy to hear of others that you think I might have missed!

python_climate_stack

The weather/climate Python stack.

 

Core

The dashed box in the diagram represents the core of the stack, so let’s start our tour there. The default library for dealing with numerical arrays in Python is NumPy. It has a bunch of built in functions for reading and writing common data formats like .csv, but if your data is stored in netCDF format then the default library for getting data into/out of those files is netCDF4.

Once you’ve read your data in, you’re probably going to want to do some statistical analysis. The NumPy library has some built in functions for calculating very simple statistics (e.g. maximum, mean, standard deviation), but for more complex analysis (e.g. interpolation, integration, linear algebra) the SciPy library is the default.

The NumPy library doesn’t come with any plotting capability, so if you want to visualise your NumPy data arrays then the default library is matplotlib. As you can see at the matplotlib gallery, this library is great for any simple (e.g. bar charts, contour plots, line graphs), static (e.g. .png, .eps, .pdf) plots. The cartopy library provides additional functionality for common map projections, while Bokeh allows for the creation of interactive plots where you can zoom and scroll.

While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their highly flexible, all-purpose nature means relatively common/simple tasks can often require quite a bit of work (i.e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These additional libraries aren’t as flexible – they can’t do everything like the core stack can – but they can do common tasks with far less effort…

 

Generic additions

Let’s first consider the generic additional libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index (as is required with NumPy), the actual row and column headings can be used. That means Fred’s height could be obtained from a medical dataset by asking for data[‘Fred’, ‘height’], rather than having to remember the numeric index corresponding to that person and characteristic. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with SciPy and matplotlib, greatly simplifies the code development process (read: less lines of code).

One of the limitations of pandas is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available (which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude (e.g. 20 South), longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, makes the xarray data array far easier to deal with than the NumPy array. (As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.)

 

Discipline-specific additions

While the xarray library is a good option for those working in the weather and climate sciences (especially those dealing with large multi-dimensional arrays from model simulations), the team of software developers at the MetOffice have taken a different approach to building on top of the core stack. Rather than striving to make their software generic (xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their Iris library are dealing with weather/climate data. Doing this allows them to make common weather/climate tasks super quick and easy, and it also means they have added lots of useful functions specific to weather/climate science.

In terms of choosing between xarray and Iris, some people like the slightly more weather/climate-centric experience offered by Iris, while others don’t like the restrictions that places on their work and prefer the generic xarray experience (e.g. to use Iris your netCDF data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience.

 

Simplifying data exploration

While the plotting functionality associated with xarray and Iris speeds up the process of visually exploring data (as compared to matplotlib), making minor tweaks to a plot or iterating over multiple time steps is still rather cumbersome. In an attempt to overcome this issue, a library called HoloViews was recently released. By using matplotlib and Bokeh under the hood, it allows for the generation of static or interactive plots where tweaking and iterating are super easy (especially in the Jupyter Notebook, which is where more and more people are doing their data exploration these days). Since HoloViews doesn’t have support for geographic plots, GeoViews has been created on top of it (which incorporates cartopy and can handle Iris or xarray data arrays).

 

Sub-discipline-specific libraries

So far we’ve considered libraries that do general, broad-scale tasks like data input/output, common statistics, visualisation, etc. Given their large user base, these libraries are usually written and supported by large companies (e.g. Continuum Analytics supports conda, Bokeh and HoloViews/Geoviews), large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) or the wider PyData community (e.g. pandas, xarray). Within each sub-discipline of weather and climate science, individuals and research groups take these libraries and apply them to their very specific data analysis tasks. Increasingly, these individuals and groups are formally packaging and releasing their code for use within their community. For instance, Andrew Dawson (an atmospheric scientist at Oxford) does a lot of EOF analysis and manipulation of wind data, so he has released his eofs and windspharm libraries (which are able to handle data arrays from NumPy, Iris or xarray). Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility have released their Python ARM Radar Toolkit (Py-ART) for analysing weather radar data, and a similar story is true for MetPy. It would be impossible to list all the sub-discipline-specific libraries in this post, but the PyAOS community is an excellent resource if you’re trying to find out what’s available in your area of research.

 

Installing the stack

While the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C, because speed is important when data arrays get really large. Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite scientific Python libraries working together used to be problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released a package manager called conda, which is able to handle non-Python dependencies. The documentation for almost all modern scientific Python packages will suggest that you use conda for installation.

 

Navigating the stack

All of the additional libraries discussed in this post essentially exist to hide the complexity of the core libraries (in software engineering this is known as abstraction). Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib. GeoViews was built to hide some of the complexity of Iris, cartopy and Bokeh. So if you want to start exploring your data, start at the top right of the stack and move your way down and left as required. If GeoViews doesn’t have quite the right functions for a particular plot that you want to create, drop down a level and use some Iris and cartopy functions. If Iris doesn’t have any functions for a statistical procedure that you want to apply, go back down another level and use SciPy. By starting at the top right and working your way back, you’ll ensure that you never re-invent the wheel. Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4) for extracting the metadata contained within a netCDF file, for instance, only to find that Iris automatically keeps this information upon reading a file. In this way, a solid working knowledge of the scientific Python stack can save you a lot of time and effort.

 

June 16, 2016 / Damien Irving

How to write a reproducible paper

As mentioned in a previous call for volunteers, I dedicated part of my PhD to proposing a solution to the reproducibility crisis in modern computational research. In a nutshell, the crisis has arisen because most papers do not make the data and code underpinning their key findings available, which means it is impossible to replicate and verify the results. A good amount progress has been made with respect to documenting and publishing data in recent years, so I specifically focused on software/code. I looked at many aspects of the issue including the reasons why people don’t publish their code, computational best practices and journal publishing standards, much of which is covered in an essay I published with the Bulletin of the American Meteorological Society. That essay is an interesting read if you’ve got the time (in my humble opinion!), but for this post I wanted to cut to the chase and outline how one might go about writing a reproducible paper.

On the surface, the reproducible papers I wrote as part of my PhD (i.e. as a kind of proof of concept; see here and here) look similar to any other paper. The only difference is a short computation section placed within the traditional methods section of the paper. That computation section begins with a brief, high-level summary of the major software packages that were used, with citations provided to any papers dedicated to documenting that software. Authors of scientific software are increasingly publishing overviews of their software in journals like the Journal of Open Research Software and Journal of Open Source Software, so it’s important to give them the academic credit they deserve.

Following this high level summary, the computation section points the reader to three key supplementary items:

  1. A more detailed description of the software used
  2. A copy of any code written by the authors to produce the key results
  3. A description of the data processing steps taken in producing each key result (i.e. a step-by-step account of how the software and code were actually used)

I’ll look at each of these components in turn, considering both the bare minimum you’d need to do in order be reproducible and the extra steps you could take to make things easier for the reader.

 

1. Software description

While the broad software overview provided in the computation section is a great way to give academic credit to those who write scientific software, it doesn’t provide sufficient detail to recreate the software environment used in the study. In order to provide this level of detail, the bare minimum you’d need to do is follow the advice of the Software Sustainability Institute. They suggest documenting the name, version number, release date, institution and DOI or URL of each software package, which could be included in a supplementary text file.

While such a list means your environment is now technically reproducible, you’ve left it up to the reader to figure out how to get all those software packages and libraries installed and playing together nicely. In some cases this is fine (e.g. it might be easy enough for a reader to install the handful of MATLAB toolboxes you used), but in other cases you might want to save the reader (and your future self) the pain of software installation by making use of a tool that can automatically install a specified software environment. The simplest of these is conda, which I discussed in detail in a previous post. It is primarily used for the management of Python packages, but can be used for other software as well. I install my complete environment with conda, which includes non-Python command line utilities like the Climate Data Operators, and then make that environment openly available on my channel at anaconda.org. Beyond conda there are more complex tools like Docker and Nix, which can literally install your entire environment (down to the precise operating system) on a different machine. There’s lots of debate (e.g. here) about the potential and suitability of these tools as a solution to reproducible research, but it’s fair to say that their complexity puts them out of reach for most weather and climate scientists.

 

2. Code

The next supplementary item you’ll need to provide is a copy of the code you wrote to execute those software packages. For a very simple analysis that might consist of a single script for each key result (e.g. each figure), but it’s more likely to consist of a whole library/collection of code containing many interconnected scripts. The bare minimum you’d need to do to make your paper reproducible is to make an instantaneous snapshot of that library (i.e. at the time of paper submission or acceptance) available as supplementary material.

As with the software description, this bare minimum ensures your paper is reproducible, but it leaves a few problems for both you and the reader. The first is that in order to provide an instantaneous snapshot, you’d need to make sure that all your results were produced with the latest version of your code library. In many cases this isn’t practical (e.g. Figure 3 might have been generated five months ago and you don’t want to re-run the whole time consuming process), so you’ll probably want to manage your code library with a version control system like Git, Subversion or Mercurial, so you can easily access previous versions. If you’re using a version control system you might as well hook it up to an external hosting service like GitHub or Bitbucket, so you’ve got your code backed up elsewhere. If you make your GitHub or Bitbucket repository publicly accessible then readers can view the very latest version of your code (in case you’ve made any improvements since publishing the paper), as well as submit proposed updates or bug fixes via the useful interface (which includes commenting, chat and code viewing features) that those websites provide.

 

3. Data processing steps

A code library and software description on their own are not much use to a reader;
they also need to know how that code was used in generating the results presented. The simplest way to do this is to make your scripts executable at the command line, so you can then keep a record of the series of command line entries required to produce a given result. Two of the most well known data analysis tools in the weather and climate sciences – the netCDF Operators (NCO) and Climate Data Operators (CDO) – do exactly this, storing that record in the global attributes of the output netCDF file. I’ve written a Software Carpentry lesson showing how to generate these records yourself, including keeping track of the corresponding version control revision number, so you know exactly which version of the code was executed.

As before, while these bare minimum log files ensure that your workflow is reproducible, they are not particularly comprehensible. Manually recreating workflows from these log files would be a tedious and time consuming process, even for just moderately complex analyses. To make things a little easier for the reader (and your future self), it’s a good idea to include a README file in your code library explaining the sequence of commands required to produce common/key results. You might also provide a Makefile that automatically builds and executes common workflows (Software Carpentry have a nice lesson on that too). Beyond that the options get more complex, with workflow management packages like VisTrails providing a graphical interface that allows users to drag and drop the various components of their workflow.

 

Summary

In order to ensure that your research is reproducible, you need to add a short computation section to your papers. That section should cite the major software packages used in your work, before linking to three key supplementary items: (1) a description of your software environment, (2) a copy of your code library and (3) details of the data processing steps taken in producing each key result. The bare minimum you’d need to do for these supplementary items is summarised in the table below, along with extension options that will make life easier for both the reader and your future self.

If you can think of other extension options to include in this summary, please let me know in the comments below!

 

Minimum Extension
Software description Document the name, version number, release date, institution and DOI or URL of each software package Provide a conda environment.yml file; use Docker / Nix
Code library Provide a copy of your code library Version control that library and host it in a publicly accessible code repository on GitHub or Bitbucket
Processing steps Provide a separate log file for each key result Include a README file and possibly Makefile in code library; provide output (e.g. a flowchart) from a workflow management system like Vistrails
April 13, 2016 / Damien Irving

Keeping up with Continuum

I’m going to spend the next few hundred characters gushing over a for-profit company called Continuum Analytics. I know that seems a little weird for a blog that devotes much of its content to open science, but stick with me. It turns out that if you want to keep up with the latest developments in data science, then you need to be on top of what this company is doing.

If you’ve heard the name Continuum Analytics before, it’s probably in relation to a widely used Python distribution called Anaconda. In a nutshell, Travis Oliphant (who was the primary creator of NumPy) and his team at Continuum developed Anaconda, gave it away for free to the world, and then built a thriving business around it. Continuum makes its money by providing training, consultation and support to paying customers who use Anaconda (and who are engaged in data science/analytics more generally), in much the same way that RedHat provides support to customers using Linux.

The great thing about companies like RedHat and Continuum is that because their business fundamentally depends on open source software, they contribute a great deal back to the open source community. If you’ve ever been to a SciPy conference (something I would highly recommend), you would have noticed that there’s always a few presentations from Continuum staff, whose primary job appears to be to simply work on the coolest open source projects going around. What’s more, the company seems to have a knack for supporting projects that make life much, much easier for regular data scientists (i.e. people who know how to analyse data in Python, but for which things like system administration and web programming are beyond them). For instance, the projects they support (see the full list here) can help you install software without having to know anything about system admin (conda), create interactive web visualisations without knowing Javascript (bokeh), process data arrays larger than the available RAM without knowing anything about multi-core parallel processing (dask) and even speed up your code without having to resort to a low level language (numba).

Of these examples, the most important achievement (in my opinion) is the conda package manager, which I’ve talked about previously. Once you’ve installed either Anaconda (which comes with 75 of the most popular Python data science libraries already installed) or Miniconda (which essentially just comes with conda and nothing else), you can then use conda to install pretty much any library you’d like with one simple command line entry. That’s right. If you want pandas, just type conda install pandas and it will be there, along with its dependencies, playing nicely with all your other libraries. If you decide you’d like to access pandas from the jupyter notebook, just type conda install jupyter and you’re done. There are about 330 libraries available directly like this and because they are maintained by the Continuum team, they are guaranteed to work.

While this is all really nice, other Python distributions like Canopy also come with a package manager for installing widely used libraries. What sets conda apart is the ease with which the wider community can contribute. If you’ve written a library that you’d like people to be able to install easily, you can write an associated installation package and post it at Anaconda Cloud. For instance, Andrew Dawson (a climate scientist at Oxford) has written eofs, a Python library for doing EOF analysis. Rather than have users of his software mess around installing the dependencies for eofs, he has posted a conda package for eofs at his channel on Anaconda Cloud. Just type conda install -c https://conda.anaconda.org/ajdawson eofs and you’re done; it will install eofs and all its dependencies for you. Some users (e.g. like the US Integrated Ocean Observing System) even go a step further and post packages for a wide variety of Python libraries that are relevant to the work they do. This vast archive of community contributed conda packages means there isn’t a single library I use in my daily work that isn’t available via either conda install or Anaconda Cloud. In fact, a problem I often face is that there is more than one installation package for a particular library (i.e. which one do I use? And if I get an error, where should I ask for assistance?). To solve this problem, conda-forge has recently been launched. The idea is that it will house the lone instance of every community contributed package, in order to (a) avoid duplication of effort, and (b) make it clear where questions (and suggested updates / bug fixes) should be directed.

The final mind blowing feature of conda is the ease with which you can manage different environments. Rather than lump all your Python libraries in together, it can be nice to have a clean and completely separate environment for each discrete aspect of the work you do (e.g. I have a separate environments for my ocean data analysis, atmosphere data analysis and for testing new libraries). This will sound familiar to anyone who has used virtualenv, but again the value of conda environments is the ease with which the community can share. As an example, I’ve shared the details of my ocean data analysis environment (right down to the precise version of every single Python library). I started by exporting the details of the environment by typing conda env export -n ocean-environment -f blog-example, before posting it to my channel at Anaconda Cloud (conda env upload -f blog-example). Anyone can now come along and recreate that environment on their own computer by typing conda env create damienirving/blog-example (and then source activate blog-example to get it running). This is obviously huge for the reproducibility of my work, so for my next paper I’ll be posting a corresponding conda environment to Anaconda Cloud.

If you want to know more about Continuum, I highly recommend this Talk Python To Me podcast with Travis Oliphant.

January 12, 2016 / Damien Irving

Podcasting comes to weather and climate science

Over the past few years, podcasts have begun to emerge as the next great storytelling platform. The format is open to anyone with a laptop, a microphone, and access to the web, which means it’s kind of like blogging, only your audience isn’t restricted to consuming your content via words on a screen. They can listen to you in the car on the way to work, on the stationary bike at the gym or at any other time a little background noise is needed to pass the time away.

While I’m as excited as the next podcast enthusiast about the new season of Serial, what’s even more exciting is that a number of podcasts for weather and climate science nerds have been launched in recent months. These ones have really caught my ear:

  • Forecast: Climate Conversations with Michael White – a podcast about climate science and climate scientists, hosted by Nature’s editor for climate science
  • Mostly Weather – a team from the MetOffice explores a new, mostly weather based topic each month
  • Climate History Podcast – interviews with people in climate change research, journalism, and policymaking. It is the official podcast of the Climate History Network and the popular website HistoricalClimatology.com
  • The Method – a podcast that tells the stories of what is working in science and what is not. It launches in mid-2016 and sounds right up the alley of this blog
  • (Depending on where you live, a Google Search might also turn up a weekly podcast or two that discusses the current weather in your region)

There’s also a number of data science podcasts out there, which can be useful depending on the type of data analysis that you do. I’ve found some of the Talk Python to Me episodes to be very relevant to my daily work.

If you know of any other great weather and climate science podcasts, please share the details in the comments below!

November 5, 2015 / Damien Irving

A call for reproducible research volunteers

Around the time that I commenced my PhD (May 2012… yes, I know I should have finished by now!) there were lots of editorial-style articles popping up in prestigious journals like Nature and Science about the reproducibility crisis in computational research. Most papers do not make the data and code underpinning their key findings available, nor do they adequately specify the software packages and libraries used to execute that code, which means it’s impossible to replicate and verify their results. Upon reading a few of these articles, I decided that I’d try and make sure that the results presented in my PhD research were fully reproducible from a code perspective (my research uses publicly available reanalysis data, so the data availability component of the crisis wasn’t so relevant to me).

While this was an admirable goal, I quickly discovered that despite the many editorials pointing to the problem, I could find very few (none, in fact) regular weather/climate papers that were actually reproducible. (By “regular” I mean papers where code was not the main focus of the work, like it might be in a paper describing a new climate model.) A secondary aim of my thesis therefore became to consult the literature on (a) why people don’t publish their code, and (b) best practices for scientific computing. I would then use that information to devise an approach to publishing reproducible research that reduced the barriers for researchers while also promoting good programming practices.

My first paper using that approach was recently accepted for publication with the Journal of Climate (see the post-print here on Authorea) and the Bulletin of the American Meteorological Society have just accepted an essay I’ve written explaining the rationale behind the approach. In a nutshell, it requires the author to provide three key supplementary items:

  1. A description of the software packages and operating system used
  2. A (preferably version controlled and publicly accessible) code repository, and
  3. A collection of supplementary log files that capture the data processing steps taken in producing each key result

The essay then goes on to suggest how academic journals (and institutions that have an internal review process) might implement this as a formal minimum standard for the communication of computational results. I’ve contacted the American Meteorological Society (AMS) Board on Data Stewardship about this proposed minimum standard (they’re the group who decide the rules that AMS journals impose around data and code availability) and they’ve agreed to discuss it when they meet at the AMS Annual Meeting in January.

This is where you come in. I’d really love to find a few volunteers who would be willing to try and meet the proposed minimum standard when they write their next journal paper. These volunteers could then give feedback on the experience, which would help inform the Board on Data Stewardship in developing a formal policy around code availability. If you think you might like to volunteer, please get in touch!

 

September 4, 2015 / Damien Irving

Managing your data

If you’re working on a project that involves collecting (e.g. from a network of weather stations) or generating (e.g. running a model) data, then it’s likely that one of the first things you did was develop a data management plan. Many funding agencies (e.g. the National Science Foundation) actually formally require this, and such plans usually involve outlining your practices for collecting, organising, backing up, and storing the data you’ll be generating.

What many people don’t realise is that even if you aren’t collecting or generating your own data (e.g. you might simply download a reanalysis or CMIP5 dataset), you should still start your project by developing a data management plan. That plan obviously doesn’t need to consider everything a data collection/generation project does (e.g. you don’t need to think about archiving the data at a site like Figshare), but there are a few key things all data analysis projects need to consider, regardless of whether they collected and/or generated the original data or not.
 
1. Data Reference Syntax

The first thing to define is your Data Reference Syntax (DRS) – a convention for naming your files. As an example, let’s look at a file from the data archive managed by Australia’s Integrated Marine Observing System (IMOS).

.../thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz

That’s a lot of information to take in, so let’s focus on the structure of the file directory first:

.../thredds/dodsC/<project>/<organisation>/<collection>/<facility>/<data-type>/<site-code>/<year>/

From this we can deduce, without even inspecting the contents of the file, that we have data from the IMOS project that is run by the eMarine Information Infrastructure (eMII). It was collected in 2012 at the Turquoise Coast, Western Australia (TURQ) site of the Australian Coastal Ocean Radar Network (ACORN), which is a network of high frequency radars that measure the ocean surface current. The data type has a sub-DRS of its own, which tells us that the data represents the 1-hourly average surface current for a single month (October 2012), and that it is archived on a regularly spaced spatial grid and has not been quality controlled. The file is located in the “demos” directory, as it has been generated for the purpose of providing an example for users at the very helpful Australian Ocean Data Network user code library.

Just in case the file gets separated from this informative directory structure, much of the information is repeated in the file name itself, along with some more detailed information about the start and end time of the data, and the last time the file was modified:

<project>_<facility>_V_<time-start>_<site-code>_FV00_<data-type>_<time-end>_<modified>.nc.gz

In the first instance this level of detail seems like a bit of overkill, but consider the scope of the IMOS data archive. It is the final resting place for data collected by the entire national array of oceanographic observing equipment in Australia, which monitors the open oceans and coastal marine environment covering physical, chemical and biological variables. Since the data are so well labelled, locating all monthly timescale ACORN data from the Turquoise Coast and Rottnest Shelf sites (which represents hundreds of files) would be as simple as typing the following at the command line:

$ ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc

While it’s unlikely that your research will ever involve cataloging data from such a large observational network, it’s still a very good idea to develop your own personal DRS for the data you do have. This often involves investing some time at the beginning of a project to think carefully about the design of your directory and file name structures, as these can be very hard to change later on. The combination of bash shell wildcards and a well planned DRS is one of the easiest ways to make your research more efficient and reliable.
 
2. Data provenance

In defining my own DRS, I added some extra fields to cater for the intermediary files that typically get created throughout the data analysis process. For instance, I added a field to indicate the temporal aspects of the data (e.g. to indicate if the data are an anomaly relative to some base period) and another for the spatial aspects (e.g. to indicate whether the data have been re-gridded). While keeping track of this information via the DRS is a nice thing to do (it definitely helps with bash wildcards and visual identification of files), more detailed information needs to be recorded for the data to be truly reproducible. A good approach to recording such information is the procedure followed by the Climate Data Operators (CDO) and NetCDF Operators (NCO). Whenever an NCO or CDO utility (e.g. ncks, ncatted, cdo mergetime) is executed at the command line, a time stamp followed by a copy of the command line entry is automatically appended to the global attributes of the output netCDF file, thus maintaining a complete history of the data processing steps. Here’s an example:

Tue Jun 30 07:35:49 2015: cdo runmean,30 va_ERAInterim_500hPa_daily_native.nc va_ERAInterim_500hPa_030day-runmean_native.nc

You might be thinking, “this is all well and good, but what about data processing steps that don’t use NCO, CDO or even netCDF files?” It turns out that if you write a script (e.g. in Python, R or whatever language you’re using) that can be executed from the command line, then it only takes an extra few lines of code to parse the associated command line entry and append that information to the global attributes of a netCDF file (or a corresponding metadata text file if dealing with file formats that don’t carry their metadata with them). To learn how to do this using Python, check out the Software Carpentry lesson on Data Management in the Ocean, Weather and Climate Sciences.
 
3. Backup

Once you’ve defined your DRS and have implemented the NCO/CDO approach to data provenance, the final thing to think about is backing up your data. This is something I’ve discussed in detail in a previous post, but the crux of the story is that if your starting point files (i.e. the data files required at the very first step of your data processing) can be easily downloaded (e.g. reanalysis or CMIP5 data), then you probably don’t need your local copy to be backed up. All of your code should be version controlled and backed up via an external hosting service like GitHub and Bitbucket, so you can simply re-download the data and re-run your analysis scripts if disaster strikes. If you generated your starting point files from scratch on the other hand (e.g. you collected weather observations or ran a model that would take months to re-run), then backup is absolutely critical and would be part of your data management plan.

 

June 3, 2015 / Damien Irving

The CWSLab workflow tool: an experiment in community code development

Give anyone working in the climate sciences half a chance and they’ll chew your ear off about CMIP5. It’s the largest climate modelling project ever conducted and formed the basis for much of the IPCC Fifth Assessment Report, so everyone has an opinion on which are the best models, the level of confidence we should attach to projections derived from the models, etc, etc. What they probably won’t tell you about is the profound impact that CMIP5 has had on climate data processing and management. In the lead up to CMIP5 (2010/11), I was working at CSIRO in a support scientist role. When I think back on that time, I refer to it as The Great Data Duplication Panic. In analysing output from CMIP3 and earlier modelling projects, scientists simply downloaded data onto their local server (or even personal computer) and did their own analysis in isolation. At the CSIRO Aspendale campus alone there must have been a dozen copies of the CMIP3 dataset floating around. Given its sheer size (~3 PetaBytes!), we recognised very quickly that this kind of data duplication just wasn’t going to fly for CMIP5.

Support scientists at CSIRO and the Bureau of Meteorology were particularly panicked about two types of data duplication: download duplication (i.e. duplication of the original dataset) and processing duplication (e.g. duplication of similarly processed data such as a common horizontal regridding or extraction of the Australian region). It was out of this panic that the Climate and Weather Science Laboratory (CWSLab) was born (although it wasn’t called that back then).

Download duplication

The download duplication problem has essentially been addressed by two major components of the CWSLab project. The NCI data library stores a variety of local and international climate and weather datasets (including CMIP5), while the NCI computational infrastructure is built directly on top of that library so you can do your data processing in situ (i.e. as opposed to downloading the data to your own machine). The computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have the CWS Virtual Desktops. These virtual desktops have more grunt than your personal laptop or desktop (4 CPUs, 20 GB RAM, 66 GB storage) and were deemed the best way to provide scientists with remote access to data exploration tools like MATLAB and UV-CDAT that involve a graphical user interface.

While solving the download duplication problem has been a fantastic achievement, it was aided by the fact that the solution didn’t require everyday climate scientists to change their behaviour in any appreciable way. They simply login to a machine at NCI rather than their local server and proceed with their data analysis as per normal. The processing duplication problem on the other hand will require a change in behaviour and may therefore be more difficult to solve…

Processing duplication

The CWSLab answer to the processing duplication problem is the CWSLab workflow tool, which can be run from the CWS Virtual Desktop. The tool is a plugin/add-on to the VisTrails workflow and provenance management system (see this previous post for a detailed discussion of workflow automation) and allows you to build, run and capture metadata for analyses involving the execution of multiple command line programs (see this example Nino 3.4 workflow). The code associated with the VisTrails plugin is hosted in three separate public GitHub repositories:

  • cwsl-ctools: A collection of command line programs used in performing common climate data analysis tasks. The programs can be written in any programming language, they just have to be able to parse the command line.
  • cwsl-mas: The source code for the plugin. In essence, it contains a wrapper for each of the command line programs in the cwsl-ctools repo which tells VisTrails how to implement that program.
  • cwsl-workflows: A collection of example workflows that use the VisTrails plugin.

The CWSLab workflow tool writes output files using a standardised data reference syntax, which is how it’s able to solve the processing duplication problem. For instance, if someone has already regridded the ACCESS1-0 model to a 1 by 1 degree global grid, the system will be able to find that file rather than re-creating/duplicating it.

A community model for help and code development

Unlike the NCI infrastructure and data library which have dedicated staff, the group of support scientists behind the VisTrails plugin have very small and infrequent time allocations on the project. This means that if the workflow tool is to succeed in the long run, all scientists who do climate model analysis at NCI will need to pitch in on both code development and requests for help.

Fortunately, GitHub is perfectly setup to accommodate both tasks. Scientists can “fork” a copy of the cwsl code repositories to their own GitHub account, make any changes to the code that they’d like to see implemented (e.g. a new script for performing linear regression), and then submit a “pull request” to the central cwsl repo. The community can then view the proposed changes and discuss them before finally accepting or rejecting. Similarly, instead of a help desk, requests for assistance are posted to the cwsl-mas chat room on Gitter. These rooms are a new feature associated with GitHub code repositories that are specifically designed for chatting about code. People post questions, and anyone in the community who knows the answer can post a reply. If the question is too long/complex for the chat room, it can be posted as an issue on the relevant GitHub repo for further community discussion.

Multiple birds with one stone

By adopting a community approach, the workflow tool addresses a number of other issues besides data duplication.

  • Code review. Software developers review each other’s code all the time, but scientists never do. The Mozilla Science Lab have now run two iterations of their Code Review for Scientists project to figure out when and how scientific code should be reviewed, and their findings are pretty clear. Code review at the completion of a project (e.g. when you submit a paper to a journal) is fairly pointless, because the reviewer hasn’t been intimately involved in the code development process (i.e. they can make cosmetic suggestions but nothing of substance). Instead, code review needs to happen throughout a scientific research project. The pull request system used by the CWSLab workflow tool allows for this kind of ongoing review.
  • Code duplication. Any scientist that is new to climate model data analysis has to spend a few weeks (at least) writing code to do basic data input/output and processing. The cwsl-ctools repo means they no longer need to reinvent the wheel – they have access to a high quality (i.e. lots of people have reviewed it) code repository for all those common and mundane data analysis tasks.
  • Reproducible research. The reproducibility crisis in computational research has been a topic of conversation in the editorial pages of Nature and Science for a number of years now, however very few papers in today’s climate science journals include sufficient documentation (i.e. details of the software and code used) for readers to reproduce key results. The CWSLab workflow tool automatically captures detailed metadata about a given workflow (right down to the precise version of the code that was executed; see here for details) and therefore makes the generation of such documentation easy.

Conclusion

The CWSLab workflow tool is an ambitious and progressive initiative that will require a shift in the status quo if it is to succeed. Researchers will need to overcome the urge to develop code in isolation and the embarrassment associated with sharing their code. They’ll also have to learn new skills like version control with git and GitHub and how to write scripts that can parse the command line. These things are not impossible (e.g. Software Carpentry teaches command line programs and version control in a single afternoon) and the benefits are clear, so here’s hoping it takes off!

April 6, 2015 / Damien Irving

Workflow automation

In previous posts (e.g. What’s in your bag?) I’ve discussed the various tools I use for data analysis. I use NCO for making simple edits to the attributes of netCDF files, CDO for routine calculations on netCDF files and a whole range of Python libraries for doing more complicated analysis and visualisation. In years gone by I’ve also included NCL and Fortran in the mix. Such diversity is pretty common (i.e. almost nobody uses a single programming language or tool for all their analysis) so this post is my attempt at an overview of workflow automation. In other words, how should one go about tying together the various tools they use to produce a coherent, repeatable data analysis pipeline?

The first thing to note is that the community has not converged on a single best method for workflow automation. Instead, there appears to be three broad options depending on the complexity of your workflow and the details of the tools involved:

  1. Write programs that act like any other command line tool and then combine them with a shell script or build manager
  2. Use an off-the-shelf workflow management system
  3. Write down the processing steps in a lab notebook and re-execute them manually

Let’s consider these approaches one by one:

 

1. Command line

Despite the fact that its commands are infuriatingly terse and cryptic, the Unix shell has been around longer than most of its users have been alive. It has survived so long because of the ease with which (a) repetitive tasks can be automated and (b) existing programs can be combined in new ways. Given that NCO and CDO are command line tools (i.e. you’re probably going to be using the command line anyway), it could be argued that the command line is the most natural home for workflow automation in the weather and climate sciences. For instance, in order to integrate my Python scripts with the rest of my workflow, I use the argparse library to make those scripts act like any other command line program. They can be executed from the command line, ingest arguments and options from the command line, combine themselves with other command line programs via pipes and filters, and output help information just like a regular command line program.

Armed with my collection of command line-native Python scripts, the easiest way to link multiple processing steps is to store them in a shell script. For instance, I could execute the following hypothetical workflow by storing all the steps (i.e. command line entries) in a shell script called run-streamfunction.sh.

  1. Edit the “units” attribute of the original zonal and meridional wind data netCDF files (NCO)
  2. Calculate the streamfunction from the zonal and meridional wind (calc_streamfunction.py)
  3. Calculate the streamfunction anomaly by subtracting the climatological mean at each timestep (CDO)
  4. Apply a 30 day running mean to the streamfunction anomaly data (CDO)
  5. Plot the average streamfunction anomaly for a time period of interest (plot_streamfunction.py)

This would be a perfectly valid approach if I was dealing with a small dataset, but let’s say I wanted to process 6 hourly data from the JRA-55 reanalysis dataset over the period 1958-2014 for the entire globe. The calc_streamfunction.py script I wrote would take days to run on the server in my department in this situation, so I’d rather not execute every single step in run-streamfunction.sh every time I change the time period used for the final plot. What I need is a build manager – a smarter version of run-streamfunction that can figure out whether previous steps have already been executed and if they need to be updated.

The most widely used build manager on Unix and its derivatives is called Make. Like the Unix shell it is old, cryptic and idiosyncratic, but it’s also fast, free and well-documented, which means it has stood the test of time. I started using Make to manage my workflows about a year ago and it has revolutionised the way I work. I like it because of the documentation and also the fact that it’s available no matter what machine I’m on, however there are other options (e.g. doit, makeflow, snakemake, ruffus) if you’d like something a little less cryptic.

To learn how to apply the command line approach to your own workflow automation, check out these Software Carpentry lessons:

 

2. Workflow management systems

The command line discussion above suggests the use of shell scripts for automating small, simple data processing pipelines, and build managers like Make and doit for pipelines that are either slightly more complicated or have steps that you’d rather not repeat unnecessarily (e.g. steps that take many hours/days to run). For many weather and climate scientists (myself included), this is as far as you’ll need to go. Make and doit have all the functionality you’ll ever really need for automatically executing a data analysis pipeline, and by following the process laid out in the data management lesson linked to above you’ll be able to document that pipeline (i.e. produce a record of the provenance of your data).

But what if you’re working on a project that is too big and complicated for a simple Makefile or two? The management of uber-complex workflows such as those associated with running coupled climate models or processing the whole CMIP5 data archive can benefit greatly from specialised workflow management systems like VisTrails, pyRDM, Sumatra or Pegasus. These systems can do things like manage resource allocation for parallel computing tasks, execute steps that aren’t run from the command line, automatically publish data to a repository like Figshare and produce nice flowcharts and web interfaces to visualise the entire workflow.

I’ve never used one of these systems, so I’d love to hear from anyone who has. In particular, I’m curious to know whether such tools could be used for smaller/simpler workflows, or whether the overhead associated with setting up and learning the system cancels out any benefit over simpler options like Make and doit.

 

3. The semi-manual approach

While writing command line programs is a relatively simple and natural thing to do in Python, it’s not the type of workflow that is typically adopted by users of more self-contained environments like MATLAB and IDL. From my limited experience/exposure, it appears that users of these environments tend not to link the data processing that happens within MATLAB or IDL with processing that happens outside of it. For instance, they might pre-process their data using NCO or CDO at the command line, before feeding the resulting data files into MATLAB or IDL to perform additional data processing and visualisation. This break in the workflow implies that some manual intervention is required to check whether previous processing steps need to be executed and to initiate the next step in the process (i.e. something that Make or doit would do automatically). Manual checking and initiation is not particularly problematic for smaller pipelines, but can be error prone (and time consuming) as workflows get larger and more complex.

Since I’m by no means an expert in MATLAB or IDL, I’d love to hear how regular users of those tools manage their workflows.