Skip to content
May 8, 2017 / Damien Irving

A vision for CMIP6 in Australia

Most climate researchers would be well aware that phase 6 of the Climate Model Intercomparison Project (CMIP6) is now underway. The experiments have been designed, the modelling groups are gearing up to run them, and data should begin to come online sometime next year (see this special issue of Geoscientific Model Development for project details). As is always the case with a new iteration of CMIP, this one is going to be bigger and better than the last. By better I mean cooler experiments and improved model documentation (via the shiny new Earth System Documentation website), and by bigger I mean more data. At around 3 Petabytes in total size, CMIP5 was already so big that it was impractical for most individual research institutions to host their own copy. In Australia, the major climate research institutions (e.g. Bureau of Meteorology, CSIRO, ARC Centre of Excellence for Climate System Science – ARCCSS) got around this problem by enlisting the help of the National Computational Infrastructure (NCI) in Canberra. A similar arrangement is currently being planned for CMIP6, so I wanted to share my views (as someone who has spent a large part of the last decade wrangling CMIP3 and CMIP5 data) on what is required to help Australian climate researchers analyse that data with a minimum of fuss.

(Note: In focusing solely on researcher-related issues, I’m obviously ignoring vitally important technical issues related to data storage and funding issues etc. Assuming all that gets sorted, this post looks at how the researcher experience might be improved.)


1. A place to analyse the data

In addition to its sheer size, it’s important to note that the CMIP6 dataset will be in flux for many years as modelling groups begin to contribute data (and then revise and re-issue erroneous data) from 2018 onwards. For both these reasons, it’s not practical for individual researchers and/or institutions to be creating their own duplicate copies of the dataset. Recognising this issue (which is not unique to the CMIP projects), NCI have built a whole computational infrastructure directly on top of their data library, so that researchers can do their data processing without having to copy/move data anywhere. This computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have their Virtual Desktop Infrastructure. These virtual desktops have more grunt than your personal laptop or desktop computer (4 CPUs, 20 GB RAM, 66 GB storage) and come with a whole bunch of data exploration tools pre-installed. Better still, they are isolated from the rest of the system in the sense that unlike when you’re working on Raijin (or any other shared supercomputer), you don’t have to submit processes that will take longer than 15 or so minutes to the queuing system. I’ve found the virtual desktops to be ideal for analysing CMIP5 data (I do all my CMIP5 data analysis on them, including large full-depth ocean data processing) and can’t see any reason why they wouldn’t be equally suitable for CMIP6.


2. A way to locate and download data

Once you’ve logged into a virtual desktop, you need to be able to (a) locate the CMIP data of interest that’s already been downloaded to the NCI data library, and (b) find out if there’s data of interest available elsewhere on the international Earth System Grid. In the case of CMIP5, Paola Petrelli (with help from the rest of the Computational Modelling Support team at the ARCCSS) has developed an excellent package called ARCCSSive that does both these things. For data located elsewhere on the grid, it also gives you the option of automatically sending a request to Paola for the data to be downloaded to the NCI data library. (They also have a great help channel on Slack if you get stuck and have questions.)

Developing and maintaining a package like ARCCSSive is no trivial task, particularly as the Earth System Grid Federation (ESGF) continually shift the goalposts by tweaking and changing the way the data is made available. In my opinion, one of the highest priority tasks for CMIP6 would be to develop and maintain an ARCCSSive-like tool that researchers can use for data lookup and download requests.


3. A way to systematically report and handle errors in the data

Before a data file is submitted to a CMIP project, it is supposed to have undergone a series of checks to ensure that the data values are reasonable (e.g. nothing crazy like a negative rainfall rate) and that the metadata meets community agreed standards. Despite these checks, data errors and metadata inconsistencies regularly slip through the cracks and many hours of research time is spent weeding out and correcting these issues. For CMIP5, there is a process (I think) for notifying the relevant modelling group (via the ESGF maybe?) of an error you’ve found, but it will be many months (if ever) before a file gets corrected and re-issued. For easy-to-fix errors, researchers will therefore often generate a fixed file (which is only available in their personal directories on the NCI system) and then move on with their analysis.

The obvious problem with this sequence is that the original file hasn’t been flagged as erroneous (and no details of how to fix it archived), which means the next researcher who comes along will experience the same problem all over again. The big improvement I think we can make between CMIP5 and CMIP6 is a community effort to flag erroneous files, share suggested fixes and ultimately provide temporary corrected data files until the originals are re-issued. This is something the Australian community has talked about for CMIP5, but the farthest we got was a wiki that is not widely used. (Paola has also added warning/errata functionality to the ARCCSSive package so that users can filter out bad data.)

In an ideal world, the ESGF would coordinate this effort. I’m imagining a GitHub page where CMIP6 users from around the world could flag data errors and for simple cases submit code that fixes the problem. A group of global maintainers could then review these submissions, run accepted code on problematic data files and provide a “corrected” data collection for download. As part of the ESGF, the NCI could push for the launch of such an initiative. If it turns out that the ESGF is unwilling or unable, NCI could facilitate a similar process just for Australia (i.e. community fixes for the CMIP data that’s available in the NCI data libary).


4. Community maintained code for common tasks

Many Australian researchers perform the same CMIP data analysis tasks (e.g. calculate the Nino 3.4 index from sea surface temperature data or the annual mean surface temperature over Australia), which means there’s a fairly large duplication of effort across the community. To try and tackle this problem, computing support staff from the Bureau of Meteorology and CSIRO launched the CWSLab workflow tool, which was an attempt to get the climate community to share and collaboratively develop code for these common tasks. I actually took a one-month break during my PhD to work on that project and even waxed poetic about it in a previous post. I still love the idea in principle (and commend the BoM and CSIRO for making their code openly available), but upon reflection I feel like it’s a little ahead of its time. The broader climate community is still coming to grips with the idea of managing its personal code with a version control system; it’s a pretty big leap to utilising and contributing to an open source community project on GitHub, and that’s before we even get into the complexities associated with customising the VisTrails workflow management system used by the CWSLab workflow tool. I’d much prefer to see us aim to get a simple community error handling process off the ground first, and once the culture of code sharing and community contribution is established the CWSLab workflow tool could be revisited.


In summary, as we look towards CMIP6 in Australia, here’s how things look from the perspective of a scientist who’s been wrangling CMIP data for years:

  1. The NCI virtual desktops are ready to go and fit for purpose
  2. The ARCCSS software for locating and downloading CMIP5 data is fantastic. Developing and maintaining a similar tool for CMIP6 should be a high priority.
  3. The ESGF (or failing that, NCI) could lead a community-wide effort to identify and fix bogus CMIP data files
  4. A community maintained code repository for common data processing tasks (i.e. the CWSLab workflow tool) is an idea that is probably ahead of its time

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: