Skip to content
May 8, 2017 / Damien Irving

A vision for CMIP6 in Australia

Most climate researchers would be well aware that phase 6 of the Climate Model Intercomparison Project (CMIP6) is now underway. The experiments have been designed, the modelling groups are gearing up to run them, and data should begin to come online sometime next year (see this special issue of Geoscientific Model Development for project details). As is always the case with a new iteration of CMIP, this one is going to be bigger and better than the last. By better I mean cooler experiments and improved model documentation (via the shiny new Earth System Documentation website), and by bigger I mean more data. At around 3 Petabytes in total size, CMIP5 was already so big that it was impractical for most individual research institutions to host their own copy. In Australia, the major climate research institutions (e.g. Bureau of Meteorology, CSIRO, ARC Centre of Excellence for Climate System Science – ARCCSS) got around this problem by enlisting the help of the National Computational Infrastructure (NCI) in Canberra. A similar arrangement is currently being planned for CMIP6, so I wanted to share my views (as someone who has spent a large part of the last decade wrangling CMIP3 and CMIP5 data) on what is required to help Australian climate researchers analyse that data with a minimum of fuss.

(Note: In focusing solely on researcher-related issues, I’m obviously ignoring vitally important technical issues related to data storage and funding issues etc. Assuming all that gets sorted, this post looks at how the researcher experience might be improved.)

 

1. A place to analyse the data

In addition to its sheer size, it’s important to note that the CMIP6 dataset will be in flux for many years as modelling groups begin to contribute data (and then revise and re-issue erroneous data) from 2018 onwards. For both these reasons, it’s not practical for individual researchers and/or institutions to be creating their own duplicate copies of the dataset. Recognising this issue (which is not unique to the CMIP projects), NCI have built a whole computational infrastructure directly on top of their data library, so that researchers can do their data processing without having to copy/move data anywhere. This computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have their Virtual Desktop Infrastructure. These virtual desktops have more grunt than your personal laptop or desktop computer (4 CPUs, 20 GB RAM, 66 GB storage) and come with a whole bunch of data exploration tools pre-installed. Better still, they are isolated from the rest of the system in the sense that unlike when you’re working on Raijin (or any other shared supercomputer), you don’t have to submit processes that will take longer than 15 or so minutes to the queuing system. I’ve found the virtual desktops to be ideal for analysing CMIP5 data (I do all my CMIP5 data analysis on them, including large full-depth ocean data processing) and can’t see any reason why they wouldn’t be equally suitable for CMIP6.

 

2. A way to locate and download data

Once you’ve logged into a virtual desktop, you need to be able to (a) locate the CMIP data of interest that’s already been downloaded to the NCI data library, and (b) find out if there’s data of interest available elsewhere on the international Earth System Grid. In the case of CMIP5, Paola Petrelli (with help from the rest of the Computational Modelling Support team at the ARCCSS) has developed an excellent package called ARCCSSive that does both these things. For data located elsewhere on the grid, it also gives you the option of automatically sending a request to Paola for the data to be downloaded to the NCI data library. (They also have a great help channel on Slack if you get stuck and have questions.)

Developing and maintaining a package like ARCCSSive is no trivial task, particularly as the Earth System Grid Federation (ESGF) continually shift the goalposts by tweaking and changing the way the data is made available. In my opinion, one of the highest priority tasks for CMIP6 would be to develop and maintain an ARCCSSive-like tool that researchers can use for data lookup and download requests.

 

3. A way to systematically report and handle errors in the data

Before a data file is submitted to a CMIP project, it is supposed to have undergone a series of checks to ensure that the data values are reasonable (e.g. nothing crazy like a negative rainfall rate) and that the metadata meets community agreed standards. Despite these checks, data errors and metadata inconsistencies regularly slip through the cracks and many hours of research time is spent weeding out and correcting these issues. For CMIP5, there is a process (I think) for notifying the relevant modelling group (via the ESGF maybe?) of an error you’ve found, but it will be many months (if ever) before a file gets corrected and re-issued. For easy-to-fix errors, researchers will therefore often generate a fixed file (which is only available in their personal directories on the NCI system) and then move on with their analysis.

The obvious problem with this sequence is that the original file hasn’t been flagged as erroneous (and no details of how to fix it archived), which means the next researcher who comes along will experience the same problem all over again. The big improvement I think we can make between CMIP5 and CMIP6 is a community effort to flag erroneous files, share suggested fixes and ultimately provide temporary corrected data files until the originals are re-issued. This is something the Australian community has talked about for CMIP5, but the farthest we got was a wiki that is not widely used. (Paola has also added warning/errata functionality to the ARCCSSive package so that users can filter out bad data.)

In an ideal world, the ESGF would coordinate this effort. I’m imagining a GitHub page where CMIP6 users from around the world could flag data errors and for simple cases submit code that fixes the problem. A group of global maintainers could then review these submissions, run accepted code on problematic data files and provide a “corrected” data collection for download. As part of the ESGF, the NCI could push for the launch of such an initiative. If it turns out that the ESGF is unwilling or unable, NCI could facilitate a similar process just for Australia (i.e. community fixes for the CMIP data that’s available in the NCI data libary).

 

4. Community maintained code for common tasks

Many Australian researchers perform the same CMIP data analysis tasks (e.g. calculate the Nino 3.4 index from sea surface temperature data or the annual mean surface temperature over Australia), which means there’s a fairly large duplication of effort across the community. To try and tackle this problem, computing support staff from the Bureau of Meteorology and CSIRO launched the CWSLab workflow tool, which was an attempt to get the climate community to share and collaboratively develop code for these common tasks. I actually took a one-month break during my PhD to work on that project and even waxed poetic about it in a previous post. I still love the idea in principle (and commend the BoM and CSIRO for making their code openly available), but upon reflection I feel like it’s a little ahead of its time. The broader climate community is still coming to grips with the idea of managing its personal code with a version control system; it’s a pretty big leap to utilising and contributing to an open source community project on GitHub, and that’s before we even get into the complexities associated with customising the VisTrails workflow management system used by the CWSLab workflow tool. I’d much prefer to see us aim to get a simple community error handling process off the ground first, and once the culture of code sharing and community contribution is established the CWSLab workflow tool could be revisited.

 

In summary, as we look towards CMIP6 in Australia, here’s how things look from the perspective of a scientist who’s been wrangling CMIP data for years:

  1. The NCI virtual desktops are ready to go and fit for purpose
  2. The ARCCSS software for locating and downloading CMIP5 data is fantastic. Developing and maintaining a similar tool for CMIP6 should be a high priority.
  3. The ESGF (or failing that, NCI) could lead a community-wide effort to identify and fix bogus CMIP data files
  4. A community maintained code repository for common data processing tasks (i.e. the CWSLab workflow tool) is an idea that is probably ahead of its time
April 11, 2017 / Damien Irving

Attention scientists: Frustrated with politics? Pick a party and get involved.

The March for Science is coming up on 22 April, so I’m taking a quick detour from my regular focus on research best practice. I’ve been invited to speak at the march in Hobart, Australia, so I thought I’d share what I’m going to say…

In today’s world of alternative facts and hyper-partisan public debate, there are growing calls for scientists to get involved in politics. This might take the form of speaking out on your area of expertise, participating in a non-partisan advocacy group and/or getting involved with a political party. If you think the latter sounds like the least attractive option of the three, you’re not alone. Membership of political parties has been in decline for years, to the point where many sporting clubs have more members. While this might sound like a good reason not to join a political party, I’ve found that it means your involvement can have a bigger impact than ever before.

A little over twelve-months ago, I moved to Hobart to take up a postdoctoral fellowship. As part of a new start in a new town, I decided to get actively involved with the Tasmanian Greens. Fast forward a year and I’m now the Convenor of the Denison Branch of the Party. Bob Brown (the father of the environment movement in Australia) started his political career as a Member for Denison in the Tasmanian Parliament and our current representative (Cassy O’Connor MP) is the leader of the Tasmanian Greens, so it’s been a fascinating and humbling experience so far.

Upon taking the plunge into politics, the first thing that struck me was the overwhelming reliance on volunteers. The Tasmanian Greens have very few staff, which means there is an infinite number of ways for volunteers to get involved. If your motivation lies in changing party policy in your area of expertise, you can take a lead role in re-writing that policy and campaigning for the support of the membership. If you’re happy with party policy and want to help achieve outcomes, your professional skills can definitely be put to good use. My data science skills have been in particularly high demand, and I’m now busily involved in managing our database of members and supporters. Besides this practical contribution, the experience has also been great for my mental wellbeing. Rather than simply despair at the current state of politics (which most often means ranting to like-minded friends and followers on social media), I now have an outlet for actively improving the situation.

If you’re a scientist (or simply someone who cares about the importance of knowledge, evidence and objectivity in the political process) and aren’t currently involved with a political party, I’d highly recommend giving it a go. Any party would benefit from the unique knowledge and skills you bring to the table. As with most volunteer experiences, you’ll also get out a whole lot more than you put in.

There are going to be over 400 marches around the world, so check the map and get along to the one nearest you (or better still, contact the organiser and offer to speak).

February 15, 2017 / Damien Irving

The research police

You know who I’m talking about. I’m sure every research community has them. Those annoying do-gooders who constantly advocate for things to be done the right way. When you’re trying to take a shortcut, it’s their nagging voice in the back of your mind. You appreciate that what they’re saying is important, but with so much work and so little time, you don’t always want to hear it. Since I’m fond of creating lists on this blog, here’s my research police of the weather, ocean and climate sciences:

 

Statistics

Dan Wilks is a widely regarded statistics guru in the atmospheric sciences. He is the author of the most clearly written statistics textbook I’ve ever come across, as well as great articles such as this recent essay in BAMS, which is sure to make you feel bad if you’ve ever plotted significance stippling.

 

Data visualisation

Ed Hawkins’ climate spiral visualisation received worldwide media coverage in 2016 (and even featured in the opening ceremony of the Rio Olympics). He makes the list of research police due to his end the rainbow campaign, which advocates for the use of more appropriate colour scales in climate science.

 

Communication

David Schultz is the Chief Editor of Monthly Weather Review and has authored well over 100 research articles, but is probably best known as the “Eloquent Science guy.” His book and blog are a must read for anyone wanting to improve their academic writing, reviewing and speaking.

 

Programming

Unfortunately I’m going to have to self-nominate here, as I’ve been a strong advocate for publishing reproducible computational results for a number of years now (see related post and BAMS essay). To help researchers do this, I’ve taught at over 20 Software Carpentry workshops and I’m the lead author of their climate-specific lesson materials.

 

If I’ve missed any other research police, please let me know in comments!

January 11, 2017 / Damien Irving

Need help with reproducible research? These organisations have got you covered.

The reproducibility crisis in modern research is a multi-faceted problem. If you’re working in the life sciences, for instance, experimental design and poor statistical power are big issues. For the weather, ocean and climate sciences, the big issue is code and software availability. We don’t document the details of the code and software used to analyse and visualise our data, which means it’s impossible to interrogate our methods and reproduce our results.

(For the purposes of this post, research “software” is something that has been packaged and released for use by the wider community, whereas research “code” is something written just for personal use. For instance, I might have written some code to perform and plot an EOF analysis, which calls and executes functions from the eofs software package that is maintained by Andrew Dawson at Oxford University.)

Unbeknown to most weather, ocean and climate scientists, there are a number of groups out there that want to help you make your work more reproducible. Here’s a list of the key players and what they’re up to…

 

Software Sustainability Institute (SSI)

The SSI is the go-to organisation for people who write and maintain scientific software. They provide training and support, advocate for formal career paths for scientific software developers and manage the Journal of Open Research Software, where you can publish the details of your software so that people can cite your work. They focus mainly on researchers in the UK, so it’s my hope that organisations like SSI will start popping up in other countries around the world.

 

OntoSoft

The OntoSoft project in the US has a bit of overlap with the SSI (e.g. they’re working on “software commons” infrastructure where people can submit their geoscientific software so that it can be searched and discovered by others), but in addition their Geoscientific Paper of the Future (GPF) initiative has been looking at the broader issue of how researchers should go about publishing the details of the digital aspects of their research (i.e. data, code, software and provenance/workflow). In a special GPF issue of Earth and Space Science, researchers from a variety of geoscience disciplines share their experiences in trying to document their digital research methods. The lead paper from that issue gives a fantastic overview of the options available to researchers. (My own work in this area gives a slightly more practical overview but in general covers many of the same ideas.)

 

Software Carpentry

The global network of volunteer Software Carpentry instructors run hundreds of two-day workshops around the world each year, teaching the skills needed to write reusable, testable and ultimately reproducible code (i.e. to do the things suggested by the GPF). Their teaching materials have been developed and refined for more than a decade and every instructor undergoes formal training, which means you won’t find a better learning experience anywhere. To get a workshop happening at your own institution, you simply need to submit a request at their website. They’ll then assist with finding local instructors and all the other logistics that go along with running a workshop. A sibling organisation called Data Carpentry has recently been launched, so it’s also worth checking to see if their more discipline-specific, data-centric lessons would be a better fit.

 

Mozilla Science Lab

Once you’ve walked out of a two-day Software Carpentry workshop, it can be hard to find ongoing support for your coding. The best form of support usually comes from an engaged and well connected local community, so the Mozilla Science Lab assists researchers in forming and maintaining in-person study groups. If there isn’t already a study group in your area, I’d highly recommend their study group handbook. It has a bunch of useful advice and resources for getting one started, plus they periodically run online orientation courses to go through the handbook content in detail.

 

Hopefully one or more of those organisations will be useful in your attempts to make your work more reproducible – please let me know in comments if there’s other groups/resources that I’ve missed!

 

October 4, 2016 / Damien Irving

The weather/climate Python stack

It would be an understatement to say that Python has exploded onto the data science scene in recent years. PyCon and SciPy conferences are held somewhere in the world every few months now, at which loads of new and/or improved data science libraries are showcased to the community. When the videos from these conferences are made available online (which is almost immediately at pyvideo.org), I’m always filled with a mixture of joy and dread. The ongoing rapid development of new libraries means that data scientists are (hopefully) continually able to do more and more cool things with less and less time and effort, but at the same time it can be difficult to figure out how they all relate to one another. To assist in making sense of this constantly changing landscape, this post summarises the current state of the weather and climate Python software “stack” (i.e. the collection of libraries used for data analysis and visualisation). My focus is on libraries that are widely used and that have good (and likely long-term) support, but I’m happy to hear of others that you think I might have missed!

python_climate_stack

The weather/climate Python stack.

 

Core

The dashed box in the diagram represents the core of the stack, so let’s start our tour there. The default library for dealing with numerical arrays in Python is NumPy. It has a bunch of built in functions for reading and writing common data formats like .csv, but if your data is stored in netCDF format then the default library for getting data into/out of those files is netCDF4.

Once you’ve read your data in, you’re probably going to want to do some statistical analysis. The NumPy library has some built in functions for calculating very simple statistics (e.g. maximum, mean, standard deviation), but for more complex analysis (e.g. interpolation, integration, linear algebra) the SciPy library is the default.

The NumPy library doesn’t come with any plotting capability, so if you want to visualise your NumPy data arrays then the default library is matplotlib. As you can see at the matplotlib gallery, this library is great for any simple (e.g. bar charts, contour plots, line graphs), static (e.g. .png, .eps, .pdf) plots. The cartopy library provides additional functionality for common map projections, while Bokeh allows for the creation of interactive plots where you can zoom and scroll.

While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their highly flexible, all-purpose nature means relatively common/simple tasks can often require quite a bit of work (i.e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These additional libraries aren’t as flexible – they can’t do everything like the core stack can – but they can do common tasks with far less effort…

 

Generic additions

Let’s first consider the generic additional libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index (as is required with NumPy), the actual row and column headings can be used. That means Fred’s height could be obtained from a medical dataset by asking for data[‘Fred’, ‘height’], rather than having to remember the numeric index corresponding to that person and characteristic. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with SciPy and matplotlib, greatly simplifies the code development process (read: less lines of code).

One of the limitations of pandas is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available (which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude (e.g. 20 South), longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, makes the xarray data array far easier to deal with than the NumPy array. (As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.)

 

Discipline-specific additions

While the xarray library is a good option for those working in the weather and climate sciences (especially those dealing with large multi-dimensional arrays from model simulations), the team of software developers at the MetOffice have taken a different approach to building on top of the core stack. Rather than striving to make their software generic (xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their Iris library are dealing with weather/climate data. Doing this allows them to make common weather/climate tasks super quick and easy, and it also means they have added lots of useful functions specific to weather/climate science.

In terms of choosing between xarray and Iris, some people like the slightly more weather/climate-centric experience offered by Iris, while others don’t like the restrictions that places on their work and prefer the generic xarray experience (e.g. to use Iris your netCDF data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience.

 

Simplifying data exploration

While the plotting functionality associated with xarray and Iris speeds up the process of visually exploring data (as compared to matplotlib), making minor tweaks to a plot or iterating over multiple time steps is still rather cumbersome. In an attempt to overcome this issue, a library called HoloViews was recently released. By using matplotlib and Bokeh under the hood, it allows for the generation of static or interactive plots where tweaking and iterating are super easy (especially in the Jupyter Notebook, which is where more and more people are doing their data exploration these days). Since HoloViews doesn’t have support for geographic plots, GeoViews has been created on top of it (which incorporates cartopy and can handle Iris or xarray data arrays).

 

Sub-discipline-specific libraries

So far we’ve considered libraries that do general, broad-scale tasks like data input/output, common statistics, visualisation, etc. Given their large user base, these libraries are usually written and supported by large companies (e.g. Continuum Analytics supports conda, Bokeh and HoloViews/Geoviews), large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) or the wider PyData community (e.g. pandas, xarray). Within each sub-discipline of weather and climate science, individuals and research groups take these libraries and apply them to their very specific data analysis tasks. Increasingly, these individuals and groups are formally packaging and releasing their code for use within their community. For instance, Andrew Dawson (an atmospheric scientist at Oxford) does a lot of EOF analysis and manipulation of wind data, so he has released his eofs and windspharm libraries (which are able to handle data arrays from NumPy, Iris or xarray). Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility have released their Python ARM Radar Toolkit (Py-ART) for analysing weather radar data, and a similar story is true for MetPy. It would be impossible to list all the sub-discipline-specific libraries in this post, but the PyAOS community is an excellent resource if you’re trying to find out what’s available in your area of research.

 

Installing the stack

While the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C, because speed is important when data arrays get really large. Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite scientific Python libraries working together used to be problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released a package manager called conda, which is able to handle non-Python dependencies. The documentation for almost all modern scientific Python packages will suggest that you use conda for installation.

 

Navigating the stack

All of the additional libraries discussed in this post essentially exist to hide the complexity of the core libraries (in software engineering this is known as abstraction). Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib. GeoViews was built to hide some of the complexity of Iris, cartopy and Bokeh. So if you want to start exploring your data, start at the top right of the stack and move your way down and left as required. If GeoViews doesn’t have quite the right functions for a particular plot that you want to create, drop down a level and use some Iris and cartopy functions. If Iris doesn’t have any functions for a statistical procedure that you want to apply, go back down another level and use SciPy. By starting at the top right and working your way back, you’ll ensure that you never re-invent the wheel. Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4) for extracting the metadata contained within a netCDF file, for instance, only to find that Iris automatically keeps this information upon reading a file. In this way, a solid working knowledge of the scientific Python stack can save you a lot of time and effort.

 

June 16, 2016 / Damien Irving

How to write a reproducible paper

As mentioned in a previous call for volunteers, I dedicated part of my PhD to proposing a solution to the reproducibility crisis in modern computational research. In a nutshell, the crisis has arisen because most papers do not make the data and code underpinning their key findings available, which means it is impossible to replicate and verify the results. A good amount progress has been made with respect to documenting and publishing data in recent years, so I specifically focused on software/code. I looked at many aspects of the issue including the reasons why people don’t publish their code, computational best practices and journal publishing standards, much of which is covered in an essay I published with the Bulletin of the American Meteorological Society. That essay is an interesting read if you’ve got the time (in my humble opinion!), but for this post I wanted to cut to the chase and outline how one might go about writing a reproducible paper.

On the surface, the reproducible papers I wrote as part of my PhD (i.e. as a kind of proof of concept; see here and here) look similar to any other paper. The only difference is a short computation section placed within the traditional methods section of the paper. That computation section begins with a brief, high-level summary of the major software packages that were used, with citations provided to any papers dedicated to documenting that software. Authors of scientific software are increasingly publishing overviews of their software in journals like the Journal of Open Research Software and Journal of Open Source Software, so it’s important to give them the academic credit they deserve.

Following this high level summary, the computation section points the reader to three key supplementary items:

  1. A more detailed description of the software used
  2. A copy of any code written by the authors to produce the key results
  3. A description of the data processing steps taken in producing each key result (i.e. a step-by-step account of how the software and code were actually used)

I’ll look at each of these components in turn, considering both the bare minimum you’d need to do in order be reproducible and the extra steps you could take to make things easier for the reader.

 

1. Software description

While the broad software overview provided in the computation section is a great way to give academic credit to those who write scientific software, it doesn’t provide sufficient detail to recreate the software environment used in the study. In order to provide this level of detail, the bare minimum you’d need to do is follow the advice of the Software Sustainability Institute. They suggest documenting the name, version number, release date, institution and DOI or URL of each software package, which could be included in a supplementary text file.

While such a list means your environment is now technically reproducible, you’ve left it up to the reader to figure out how to get all those software packages and libraries installed and playing together nicely. In some cases this is fine (e.g. it might be easy enough for a reader to install the handful of MATLAB toolboxes you used), but in other cases you might want to save the reader (and your future self) the pain of software installation by making use of a tool that can automatically install a specified software environment. The simplest of these is conda, which I discussed in detail in a previous post. It is primarily used for the management of Python packages, but can be used for other software as well. I install my complete environment with conda, which includes non-Python command line utilities like the Climate Data Operators, and then make that environment openly available on my channel at anaconda.org. Beyond conda there are more complex tools like Docker and Nix, which can literally install your entire environment (down to the precise operating system) on a different machine. There’s lots of debate (e.g. here) about the potential and suitability of these tools as a solution to reproducible research, but it’s fair to say that their complexity puts them out of reach for most weather and climate scientists.

 

2. Code

The next supplementary item you’ll need to provide is a copy of the code you wrote to execute those software packages. For a very simple analysis that might consist of a single script for each key result (e.g. each figure), but it’s more likely to consist of a whole library/collection of code containing many interconnected scripts. The bare minimum you’d need to do to make your paper reproducible is to make an instantaneous snapshot of that library (i.e. at the time of paper submission or acceptance) available as supplementary material.

As with the software description, this bare minimum ensures your paper is reproducible, but it leaves a few problems for both you and the reader. The first is that in order to provide an instantaneous snapshot, you’d need to make sure that all your results were produced with the latest version of your code library. In many cases this isn’t practical (e.g. Figure 3 might have been generated five months ago and you don’t want to re-run the whole time consuming process), so you’ll probably want to manage your code library with a version control system like Git, Subversion or Mercurial, so you can easily access previous versions. If you’re using a version control system you might as well hook it up to an external hosting service like GitHub or Bitbucket, so you’ve got your code backed up elsewhere. If you make your GitHub or Bitbucket repository publicly accessible then readers can view the very latest version of your code (in case you’ve made any improvements since publishing the paper), as well as submit proposed updates or bug fixes via the useful interface (which includes commenting, chat and code viewing features) that those websites provide.

 

3. Data processing steps

A code library and software description on their own are not much use to a reader;
they also need to know how that code was used in generating the results presented. The simplest way to do this is to make your scripts executable at the command line, so you can then keep a record of the series of command line entries required to produce a given result. Two of the most well known data analysis tools in the weather and climate sciences – the netCDF Operators (NCO) and Climate Data Operators (CDO) – do exactly this, storing that record in the global attributes of the output netCDF file. I’ve written a Software Carpentry lesson showing how to generate these records yourself, including keeping track of the corresponding version control revision number, so you know exactly which version of the code was executed.

As before, while these bare minimum log files ensure that your workflow is reproducible, they are not particularly comprehensible. Manually recreating workflows from these log files would be a tedious and time consuming process, even for just moderately complex analyses. To make things a little easier for the reader (and your future self), it’s a good idea to include a README file in your code library explaining the sequence of commands required to produce common/key results. You might also provide a Makefile that automatically builds and executes common workflows (Software Carpentry have a nice lesson on that too). Beyond that the options get more complex, with workflow management packages like VisTrails providing a graphical interface that allows users to drag and drop the various components of their workflow.

 

Summary

In order to ensure that your research is reproducible, you need to add a short computation section to your papers. That section should cite the major software packages used in your work, before linking to three key supplementary items: (1) a description of your software environment, (2) a copy of your code library and (3) details of the data processing steps taken in producing each key result. The bare minimum you’d need to do for these supplementary items is summarised in the table below, along with extension options that will make life easier for both the reader and your future self.

If you can think of other extension options to include in this summary, please let me know in the comments below!

 

Minimum Extension
Software description Document the name, version number, release date, institution and DOI or URL of each software package Provide a conda environment.yml file; use Docker / Nix
Code library Provide a copy of your code library Version control that library and host it in a publicly accessible code repository on GitHub or Bitbucket
Processing steps Provide a separate log file for each key result Include a README file and possibly Makefile in code library; provide output (e.g. a flowchart) from a workflow management system like Vistrails
April 13, 2016 / Damien Irving

Keeping up with Continuum

I’m going to spend the next few hundred characters gushing over a for-profit company called Continuum Analytics. I know that seems a little weird for a blog that devotes much of its content to open science, but stick with me. It turns out that if you want to keep up with the latest developments in data science, then you need to be on top of what this company is doing.

If you’ve heard the name Continuum Analytics before, it’s probably in relation to a widely used Python distribution called Anaconda. In a nutshell, Travis Oliphant (who was the primary creator of NumPy) and his team at Continuum developed Anaconda, gave it away for free to the world, and then built a thriving business around it. Continuum makes its money by providing training, consultation and support to paying customers who use Anaconda (and who are engaged in data science/analytics more generally), in much the same way that RedHat provides support to customers using Linux.

The great thing about companies like RedHat and Continuum is that because their business fundamentally depends on open source software, they contribute a great deal back to the open source community. If you’ve ever been to a SciPy conference (something I would highly recommend), you would have noticed that there’s always a few presentations from Continuum staff, whose primary job appears to be to simply work on the coolest open source projects going around. What’s more, the company seems to have a knack for supporting projects that make life much, much easier for regular data scientists (i.e. people who know how to analyse data in Python, but for which things like system administration and web programming are beyond them). For instance, the projects they support (see the full list here) can help you install software without having to know anything about system admin (conda), create interactive web visualisations without knowing Javascript (bokeh), process data arrays larger than the available RAM without knowing anything about multi-core parallel processing (dask) and even speed up your code without having to resort to a low level language (numba).

Of these examples, the most important achievement (in my opinion) is the conda package manager, which I’ve talked about previously. Once you’ve installed either Anaconda (which comes with 75 of the most popular Python data science libraries already installed) or Miniconda (which essentially just comes with conda and nothing else), you can then use conda to install pretty much any library you’d like with one simple command line entry. That’s right. If you want pandas, just type conda install pandas and it will be there, along with its dependencies, playing nicely with all your other libraries. If you decide you’d like to access pandas from the jupyter notebook, just type conda install jupyter and you’re done. There are about 330 libraries available directly like this and because they are maintained by the Continuum team, they are guaranteed to work.

While this is all really nice, other Python distributions like Canopy also come with a package manager for installing widely used libraries. What sets conda apart is the ease with which the wider community can contribute. If you’ve written a library that you’d like people to be able to install easily, you can write an associated installation package and post it at Anaconda Cloud. For instance, Andrew Dawson (a climate scientist at Oxford) has written eofs, a Python library for doing EOF analysis. Rather than have users of his software mess around installing the dependencies for eofs, he has posted a conda package for eofs at his channel on Anaconda Cloud. Just type conda install -c https://conda.anaconda.org/ajdawson eofs and you’re done; it will install eofs and all its dependencies for you. Some users (e.g. like the US Integrated Ocean Observing System) even go a step further and post packages for a wide variety of Python libraries that are relevant to the work they do. This vast archive of community contributed conda packages means there isn’t a single library I use in my daily work that isn’t available via either conda install or Anaconda Cloud. In fact, a problem I often face is that there is more than one installation package for a particular library (i.e. which one do I use? And if I get an error, where should I ask for assistance?). To solve this problem, conda-forge has recently been launched. The idea is that it will house the lone instance of every community contributed package, in order to (a) avoid duplication of effort, and (b) make it clear where questions (and suggested updates / bug fixes) should be directed.

The final mind blowing feature of conda is the ease with which you can manage different environments. Rather than lump all your Python libraries in together, it can be nice to have a clean and completely separate environment for each discrete aspect of the work you do (e.g. I have a separate environments for my ocean data analysis, atmosphere data analysis and for testing new libraries). This will sound familiar to anyone who has used virtualenv, but again the value of conda environments is the ease with which the community can share. As an example, I’ve shared the details of my ocean data analysis environment (right down to the precise version of every single Python library). I started by exporting the details of the environment by typing conda env export -n ocean-environment -f blog-example, before posting it to my channel at Anaconda Cloud (conda env upload -f blog-example). Anyone can now come along and recreate that environment on their own computer by typing conda env create damienirving/blog-example (and then source activate blog-example to get it running). This is obviously huge for the reproducibility of my work, so for my next paper I’ll be posting a corresponding conda environment to Anaconda Cloud.

If you want to know more about Continuum, I highly recommend this Talk Python To Me podcast with Travis Oliphant.

January 12, 2016 / Damien Irving

Podcasting comes to weather and climate science

Over the past few years, podcasts have begun to emerge as the next great storytelling platform. The format is open to anyone with a laptop, a microphone, and access to the web, which means it’s kind of like blogging, only your audience isn’t restricted to consuming your content via words on a screen. They can listen to you in the car on the way to work, on the stationary bike at the gym or at any other time a little background noise is needed to pass the time away.

While I’m as excited as the next podcast enthusiast about the new season of Serial, what’s even more exciting is that a number of podcasts for weather and climate science nerds have been launched in recent months. These ones have really caught my ear:

  • Forecast: Climate Conversations with Michael White – a podcast about climate science and climate scientists, hosted by Nature’s editor for climate science
  • Mostly Weather – a team from the MetOffice explores a new, mostly weather based topic each month
  • Climate History Podcast – interviews with people in climate change research, journalism, and policymaking. It is the official podcast of the Climate History Network and the popular website HistoricalClimatology.com
  • The Method – a podcast that tells the stories of what is working in science and what is not. It launches in mid-2016 and sounds right up the alley of this blog
  • (Depending on where you live, a Google Search might also turn up a weekly podcast or two that discusses the current weather in your region)

There’s also a number of data science podcasts out there, which can be useful depending on the type of data analysis that you do. I’ve found some of the Talk Python to Me episodes to be very relevant to my daily work.

If you know of any other great weather and climate science podcasts, please share the details in the comments below!

November 5, 2015 / Damien Irving

A call for reproducible research volunteers

Around the time that I commenced my PhD (May 2012… yes, I know I should have finished by now!) there were lots of editorial-style articles popping up in prestigious journals like Nature and Science about the reproducibility crisis in computational research. Most papers do not make the data and code underpinning their key findings available, nor do they adequately specify the software packages and libraries used to execute that code, which means it’s impossible to replicate and verify their results. Upon reading a few of these articles, I decided that I’d try and make sure that the results presented in my PhD research were fully reproducible from a code perspective (my research uses publicly available reanalysis data, so the data availability component of the crisis wasn’t so relevant to me).

While this was an admirable goal, I quickly discovered that despite the many editorials pointing to the problem, I could find very few (none, in fact) regular weather/climate papers that were actually reproducible. (By “regular” I mean papers where code was not the main focus of the work, like it might be in a paper describing a new climate model.) A secondary aim of my thesis therefore became to consult the literature on (a) why people don’t publish their code, and (b) best practices for scientific computing. I would then use that information to devise an approach to publishing reproducible research that reduced the barriers for researchers while also promoting good programming practices.

My first paper using that approach was recently accepted for publication with the Journal of Climate (see the post-print here on Authorea) and the Bulletin of the American Meteorological Society have just accepted an essay I’ve written explaining the rationale behind the approach. In a nutshell, it requires the author to provide three key supplementary items:

  1. A description of the software packages and operating system used
  2. A (preferably version controlled and publicly accessible) code repository, and
  3. A collection of supplementary log files that capture the data processing steps taken in producing each key result

The essay then goes on to suggest how academic journals (and institutions that have an internal review process) might implement this as a formal minimum standard for the communication of computational results. I’ve contacted the American Meteorological Society (AMS) Board on Data Stewardship about this proposed minimum standard (they’re the group who decide the rules that AMS journals impose around data and code availability) and they’ve agreed to discuss it when they meet at the AMS Annual Meeting in January.

This is where you come in. I’d really love to find a few volunteers who would be willing to try and meet the proposed minimum standard when they write their next journal paper. These volunteers could then give feedback on the experience, which would help inform the Board on Data Stewardship in developing a formal policy around code availability. If you think you might like to volunteer, please get in touch!

 

September 4, 2015 / Damien Irving

Managing your data

If you’re working on a project that involves collecting (e.g. from a network of weather stations) or generating (e.g. running a model) data, then it’s likely that one of the first things you did was develop a data management plan. Many funding agencies (e.g. the National Science Foundation) actually formally require this, and such plans usually involve outlining your practices for collecting, organising, backing up, and storing the data you’ll be generating.

What many people don’t realise is that even if you aren’t collecting or generating your own data (e.g. you might simply download a reanalysis or CMIP5 dataset), you should still start your project by developing a data management plan. That plan obviously doesn’t need to consider everything a data collection/generation project does (e.g. you don’t need to think about archiving the data at a site like Figshare), but there are a few key things all data analysis projects need to consider, regardless of whether they collected and/or generated the original data or not.
 
1. Data Reference Syntax

The first thing to define is your Data Reference Syntax (DRS) – a convention for naming your files. As an example, let’s look at a file from the data archive managed by Australia’s Integrated Marine Observing System (IMOS).

.../thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz

That’s a lot of information to take in, so let’s focus on the structure of the file directory first:

.../thredds/dodsC/<project>/<organisation>/<collection>/<facility>/<data-type>/<site-code>/<year>/

From this we can deduce, without even inspecting the contents of the file, that we have data from the IMOS project that is run by the eMarine Information Infrastructure (eMII). It was collected in 2012 at the Turquoise Coast, Western Australia (TURQ) site of the Australian Coastal Ocean Radar Network (ACORN), which is a network of high frequency radars that measure the ocean surface current. The data type has a sub-DRS of its own, which tells us that the data represents the 1-hourly average surface current for a single month (October 2012), and that it is archived on a regularly spaced spatial grid and has not been quality controlled. The file is located in the “demos” directory, as it has been generated for the purpose of providing an example for users at the very helpful Australian Ocean Data Network user code library.

Just in case the file gets separated from this informative directory structure, much of the information is repeated in the file name itself, along with some more detailed information about the start and end time of the data, and the last time the file was modified:

<project>_<facility>_V_<time-start>_<site-code>_FV00_<data-type>_<time-end>_<modified>.nc.gz

In the first instance this level of detail seems like a bit of overkill, but consider the scope of the IMOS data archive. It is the final resting place for data collected by the entire national array of oceanographic observing equipment in Australia, which monitors the open oceans and coastal marine environment covering physical, chemical and biological variables. Since the data are so well labelled, locating all monthly timescale ACORN data from the Turquoise Coast and Rottnest Shelf sites (which represents hundreds of files) would be as simple as typing the following at the command line:

$ ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc

While it’s unlikely that your research will ever involve cataloging data from such a large observational network, it’s still a very good idea to develop your own personal DRS for the data you do have. This often involves investing some time at the beginning of a project to think carefully about the design of your directory and file name structures, as these can be very hard to change later on. The combination of bash shell wildcards and a well planned DRS is one of the easiest ways to make your research more efficient and reliable.
 
2. Data provenance

In defining my own DRS, I added some extra fields to cater for the intermediary files that typically get created throughout the data analysis process. For instance, I added a field to indicate the temporal aspects of the data (e.g. to indicate if the data are an anomaly relative to some base period) and another for the spatial aspects (e.g. to indicate whether the data have been re-gridded). While keeping track of this information via the DRS is a nice thing to do (it definitely helps with bash wildcards and visual identification of files), more detailed information needs to be recorded for the data to be truly reproducible. A good approach to recording such information is the procedure followed by the Climate Data Operators (CDO) and NetCDF Operators (NCO). Whenever an NCO or CDO utility (e.g. ncks, ncatted, cdo mergetime) is executed at the command line, a time stamp followed by a copy of the command line entry is automatically appended to the global attributes of the output netCDF file, thus maintaining a complete history of the data processing steps. Here’s an example:

Tue Jun 30 07:35:49 2015: cdo runmean,30 va_ERAInterim_500hPa_daily_native.nc va_ERAInterim_500hPa_030day-runmean_native.nc

You might be thinking, “this is all well and good, but what about data processing steps that don’t use NCO, CDO or even netCDF files?” It turns out that if you write a script (e.g. in Python, R or whatever language you’re using) that can be executed from the command line, then it only takes an extra few lines of code to parse the associated command line entry and append that information to the global attributes of a netCDF file (or a corresponding metadata text file if dealing with file formats that don’t carry their metadata with them). To learn how to do this using Python, check out the Software Carpentry lesson on Data Management in the Ocean, Weather and Climate Sciences.
 
3. Backup

Once you’ve defined your DRS and have implemented the NCO/CDO approach to data provenance, the final thing to think about is backing up your data. This is something I’ve discussed in detail in a previous post, but the crux of the story is that if your starting point files (i.e. the data files required at the very first step of your data processing) can be easily downloaded (e.g. reanalysis or CMIP5 data), then you probably don’t need your local copy to be backed up. All of your code should be version controlled and backed up via an external hosting service like GitHub and Bitbucket, so you can simply re-download the data and re-run your analysis scripts if disaster strikes. If you generated your starting point files from scratch on the other hand (e.g. you collected weather observations or ran a model that would take months to re-run), then backup is absolutely critical and would be part of your data management plan.