Skip to content
July 8, 2021 / Damien Irving

We wrote a book!

For the best part of a decade, I’ve volunteered as an instructor with The Carpentries, which is a global community committed to teaching foundational coding and data science skills to researchers. From humble beginnings, things have grown to the point that there are hundreds of Carpentries workshops hosted around the globe every year. The audience at these workshops is typically researchers who are self-taught programmers (i.e. they are able to cobble together enough Python or R to clean, analyse and plot their research data) and we expose them to a number coding best practices that have solid foundations in research and experience and that improve productivity and reliability.

A big part of the Carpentries success is the two-day format (it’s not an overwhelming time commitment for busy researchers), but over the years I’ve often wondered what we’d teach if we had more time. With an entire semester, for instance, you could take a researcher through the entire lifecycle of a data analysis project, from the initial setup and code development through to a fully automated data processing pipeline and published software package. A few of years ago, Greg Wilson (who co-founded The Carpentries) assembled a small group of Carpentries instructors to try and write such a book. I very happily joined in, and I’m excited to say that Research Software Engineering with Python is now available for purchase (all proceeds go to The Carpentries). The content of the book and the associated code is licensed under a CC-BY 4.0 and MIT License, so there’s also a freely available web version of the book. A corresponding book for R users (which I’m not involved with) is also currently under development (see this landing page for related projects).

The book follows Amira and Sami as they work together to write a software package to address a real research question. The data analysis task relates to a fascinating result in the field of quantitative linguistics. Zipf’s Law states that the second most common word in a large body of text appears half as often as the most common, the third most common appears a third as often, and so on. To test whether Zipf’s Law holds for a collection of classic novels that are freely available from Project Gutenberg, Amira and Sami write a software package that counts and analyses the word frequency distribution in any arbitrary body of text.

In the process of writing and publishing this Python package to verify Zipf’s Law, the book covers how to do the following:

  • Organise small and medium-sized data science projects.
  • Use the Unix shell to efficiently manage your data and code.
  • Write Python programs that can be used on the command line.
  • Use Git and GitHub to track and share your work.
  • Work productively in a small team where everyone is welcome.
  • Use Make to automate complex workflows.
  • Enable users to configure your software without modifying it directly.
  • Test your software and know which parts have not yet been tested.
  • Find, handle, and fix errors in your code.
  • Publish your code and research in open and reproducible ways.
  • Create Python packages that can be installed in standard ways.

The book was written to be used as the material for a semester-long course at the university level (complete with exercises and solutions), although it can also be used for independent self-study. Comments and suggestions are more than welcome at the book’s GitHub repository – we’d be particularly keen to hear from anyone (before, during or after) who uses it as a textbook for a semester course or an extended Carpentries-style workshop.

February 11, 2020 / Damien Irving

#AMS100 and the US trifecta

For a long time now, my academic bucket list has had three big US conferences on it: the AGU Fall Meeting, Ocean Sciences and the AMS Annual Meeting. I knocked off the first two in 2016 and 2018 respectively, and earlier this year I was lucky enough to complete the set when I jetted off for the 100th edition of the AMS Annual Meeting in Boston.

As if knowing 5,500 weather nerds would be descending on the city, an anomalous southerly flow produced Boston’s warmest January day on record (74F) on the eve of conference. I’d been warned to pack my winter woollies, but I spent the afternoon walking the iconic Freedom Trail in a t-shirt! Things quickly returned to normal the next day (a high of 39F) and it was time to get down to business.

The first thing that strikes you at an AMS Annual Meeting is the sheer size of the weather industry in the US. Along with the usual collection of government agencies and university departments, the exhibition hall was filled with weather intelligence companies, equipment manufacturers and defense contractors. The conference program is also somewhat unique, in that it is essentially a whole bunch of mini conferences happening at the same time. I presented in the “12th Symposium on Aerosol – Cloud – Climate Interactions” and “10th Symposium on Advances in Modeling and Analysis Using Python” and there were forty or so other conferences and symposia hosted over the course of the week by various different AMS Boards and Committees.

The highlight of the week for me was the Python Symposium. I’ve been to PyCon conferences in the past and lamented the fact that such events aren’t on the radar for most academics. The AMS Committee on Environmental Information Processing Technologies essentially brings PyCon to academia, and the specific focus on the analysis of weather and climate data makes it a great learning experience. As our datasets get larger and our computing resources more complex, the pressure is on to train, prepare and upskill the “physical data scientists” of the future. Talks on artificial intelligence and machine learning were littered throughout the conference program, although in some sense it still feels like a solution looking for a problem in the weather/climate space.

There were a number of initiatives at the AMS Annual Meeting that might be worth thinking about at other conferences (e.g. like my local AMOS conference). For instance, the AMS has recently launched a podcast called AMS On The Air. Throughout the conference, the podcast hosts interviewed (in front of a live audience) many of the keynote speakers and notable attendees. I found myself listening to a couple of these short interviews at the end of each day to help figure out which keynotes to attend later in the week. There’s also an alumni evening where universities host get-togethers for past and present students and staff, and recordings of some of the conference talks are posted online for the benefit of those who can’t make it along.

The biggest disappointment for me was the climate-related content at the conference. Given that the submission deadline for papers to be included in the Second Order Draft of the IPCC Sixth Assessment Report was 31 December, I was expecting many more talks and posters presenting new results from CMIP6. As possible explanation for gaps in the program like this is that not all AMS Boards and Committees elect to hold their annual get together/s at the Annual Meeting.

Having now completed the trifecta, my overall impression from the big three US conferences is that while the AMS Annual Meeting and Ocean Sciences are clearly the premier events for meteorologists and oceanographers respectively, climate science is a little lost in the middle. This possibly isn’t surprising given the interconnected nature of the climate system, but it makes it difficult when planning/justifying a trip overseas.

December 30, 2018 / Damien Irving

Data analysis and ocean model grids

Since making the shift from an atmosphere-based PhD to an ocean-based postdoc, I’ve spent more time than I’d like getting my head around ocean model grids. In particular, the convention for modelling groups participating in CMIP5 was to archive ocean variables on the native model grid. Some of the CMIP5 ocean models run on a rectilinear grid (i.e. a regular latitude/longitude grid like the atmosphere model), but most are run on some sort of curvilinear grid (i.e. a grid where the coordinate lines are curved). The details of these curvilinear grids are in most cases not widely available, but the (netCDF) data files do contain auxillary coordinate information that specifies the location of each curvilinear grid point in latitude-longitude space.

At some point, most analysis of CMIP5 ocean data involves remapping from the native curvilinear grid to a regular latitude/longitude grid. I’ve found that the level of complexity involved in this remapping depends on whether you’re dealing with a scalar or vector quantity, and whether there are conservation properties that need to be maintained (e.g. conservation of energy or moisture).

The discussion that follows is my attempt to step through each level of complexity, exploring the software packages and analysis approaches currently used for remapping from a curvilinear to rectilinear grid.

1. Scalar quantity

The most simple case involves remapping a scalar quantity such as temperature or salinity from a curvilinear to rectilinear grid. There are a number of libraries out there for doing this, such as:

2. Scalar quantity with conservation

The next level of complexity involves remapping a scalar quantity such as ocean heat content (OHC), where it can be important to conserve the grid sum/total. The remapping packages listed above approximately conserve the grid mean, but unfortunately not the grid sum if there are missing values (i.e. due to the presence of land). This is because some ocean area/volume is gained or lost as the continents subtly change shape between the old and new grid. One of the reasons why CMIP5 ocean data is archived on the native model grid is that providing remapped data would make it impossible to close energy budgets.

To work around this problem, people tend to devise their own scheme for moving from curvilinear to rectilinear space. This typically involves making use of the auxillary coordinate information contained in the CMIP5 data files. For instance, in calculating the zonally integrated OHC (while conserving the global total), Nummelin et al (2017) essentially iterate over each one degree latitude band, adding up the OHC from all grid cells whose auxillary coordinate falls within that band. Another (potentially simpler) option can be to replace the land mask / missing values with zeros. In this case the grid sum will be conserved after remapping with the packages listed above, with the caveat that the new values along the coastline should be interpreted as a scaled/reduced value according to land/sea fraction.

3. Vector quantity

The most complex case is when remapping vector quantities such as heat flux (hf) or water velocity. The x and y components of such variables are archived as two separate variables (e.g. hfx and hfy). The complexity arises because the x and y directions on a curvilinear grid are not everywhere the same as the x (eastward) and y (northward) directions on a geographic latitude/longitude grid. This means you can’t apply the aforementioned software packages or work-arounds to move from curvilinear to rectilinear space – any remapping requires the use of information from both the native x and y components.

The most obvious solution to this problem would be to convert the native x and y components into eastward and northward components using the grid angle at each grid cell. Unfortunately, the CMIP5 modelling groups didn’t archive a grid angle variable, and even if they did it turns out that complications can arise. Using grid angles obtained for the NorESM1-M model, Outten et al (2018) calculated the northward heat transport from the native hfx and hfy components and found small differences compared to the northward heat transport output directly from the model (the hfbasin variable). Investigation revealed that the grid angle was given for the p points of the grid cells (i.e. the center of each grid cell), while the hfy and hfx were given on the top and bottom and left and right edges of the grid cell, respectively, as per a standard C-grid configuration. Thus the angles given were not precisely accurate for the locations of hfy and hfx. It’s possible that these issues might be overcome using the xgcm package (which interpolates variables from one position to another on a C-grid), but the use of that package requires detailed knowledge of the grid configuration (i.e. you need to know that it’s a C-grid in the first place), which isn’t readily accessible for most CMIP5 models.

In the end, the “zigzag” solution that Outten et al (2018) came up with was somewhat similar to the Nummelin et al (2017) scalar work-around described above. In their method, grid cells were selected along a line of single latitude. A zonal boundary was then identified from the edges of these grid cells, and the fluxes across this boundary were summed to give the meridional transport at the respective latitude. At latitudes close to the model grid poles where the grid is curved, the identified cells at a single latitude were not on the same row, thus the transport across the boundary included heat transport in both the y and x directions. The process was repeated at each latitude to obtain the complete meridional ocean heat transport.

Where possible, others get around the vector quantity issue by converting to scalar quantities. For instance, when dealing with water velocity it is possible to use helmholtz decomposition to write each vector in terms of the streamfunction and velocity potential. These scalar quantities can be remapped to a rectilinear grid, and then gradients can be calculated on the new grid to recover the eastward and northward components of the water velocity.


Once you move beyond the simple case of remapping a scalar quantity (where conservation isn’t important) from an ocean curvilinear to rectilinear grid, it’s important to be aware of the limitations of “off the shelf” software packages. For scalar quantities where conservation of the grid sum/total is important, an easy solution can be to set all missing values to zero before remapping. For vector quantities it is sometimes possible to convert to and from relevant scalar quantities, but in many cases an easy solution is lacking. If grid angles and configurations were archived/documented by modelling groups, it would be possible to make use of existing software packages (and/or write new packages) that ingest that information in order to perform the vector remapping.

September 10, 2018 / Damien Irving

Data Carpentry for atmosphere and ocean scientists

This post originally appeared on the Data Carpentry blog.

Back in late 2012, I was a couple of years into my first job out of college. My undergraduate studies had left me somewhat underprepared for the coding associated with analysing climate model data for a national science organisation, so I was searching online for assistance with Python programming. I stumbled upon the website of an organisation called Software Carpentry, which at the time was a relatively small group of volunteers running two-day scientific computing “bootcamps” for researchers. I reached out to ask if they’d be interested in running a workshop alongside the 2013 Annual Conference of the Australian Meteorological and Oceanographic Society (AMOS), and to my surprise Greg Wilson – the co-founder of the organisation – flew out to Australia to teach at our event in Melbourne and another in Sydney (the first ever bootcamps outside of North America and Europe). I trained up as an instructor soon after, and from 2014-2017 I hosted Software Carpentry workshops alongside the AMOS conference, as well as other ad hoc workshops in various meteorology and oceanography departments.

While these workshops were very popular and well received (Software Carpentry workshops always are), in the back of my mind I wanted to have a go at running a workshop designed specifically for atmosphere and ocean scientists. Instead of teaching generic skills in the hope that people would figure out how to apply them in their own context, I wanted to cut out the middle step and run a workshop in the atmosphere and ocean science context. This idea of discipline (or data-type) specific workshops was the driving force behind the establishment of Data Carpentry, so this year with their assistance I’ve developed lesson materials for a complete one-day workshop:

The workshop centers around the task of writing a Python script that calculates and plots the seasonal rainfall climatology (i.e. the average rainfall) from the output from any arbitrary climate model. Such data is typically stored in netCDF file format and follows a strict “climate and forecasting” metadata convention. Along the way, we learn about the PyAOS stack (i.e. the ecosystem of libraries used in the atmosphere and ocean sciences), how to manage and share a software environment using conda, how to write modular/reusable code, how to write scripts that behave like other command line programs, version control, defensive programming strategies and how to capture and record the provenance of the data files and figures that we produce.

I’ve run the workshop twice now (at the 2018 AMOS Conference in Sydney and at Woods Hole Oceanographic Institution last month), which means I’ve completed the alpha stage of the Data Carpentry lesson development cycle. Moving from the alpha to beta stage involves having people other than me teach, which is where you come in. If you’re a qualified Carpentries instructor and would be interested in teaching the lessons (some experience with the netCDF file format and xarray Python library is useful), please get in touch with either myself or Francois Michonneau (Curriculum Development Lead for Data Carpentry). You can also request a workshop at your institution by contacting us and we’ll reach out to instructors. There is no fee for a pilot workshop, but you would need to cover travel expenses for instructors. I’d also be happy to hear any general feedback about the lesson materials at the associated GitHub repository.

March 16, 2018 / Damien Irving

Collaborative lesson development

If you’re a regular reader of this blog, you’ve no doubt heard me talk about The Carpentries – a global community committed to teaching foundational computational and data science skills to researchers. Hundreds of Software Carpentry and Data Carpentry workshops are held around the world every year, which is a monumental effort for an initiative that only got started in earnest five or so years ago.

While these workshops have had a massive impact on the computational literacy of the research community, in my opinion the most revolutionary thing about The Carpentries is not what we teach, but how we teach it. More specifically, the revolution lies in what we do in the background to develop and maintain our lessons. When a volunteer instructor is preparing to teach a workshop, they have an extensive collection of open and easily accessible lesson materials to select from. Unlike a static textbook produced by a small group of authors, these lessons are continually refined and updated by a large and diverse community of contributors, which means they are (by a wide margin) the state-of-the-art lessons in their discipline.

This process of community lesson development is probably best explained by reflecting on my own personal experience. I participated in The Carpentries instructor training program back in 2013, which among other things gave me a grounding in various evidence-based best-practices of teaching (i.e. an understanding of the fundamental pedagogical principles underpinning the lessons). Upon teaching my first few workshops, I started to contribute back to the lessons by fixing typos and other minor issues identified by participants. As I got more experience with teaching the materials, I started to make more substantive contributions, proposing and participating in re-writes of entire sections. Now with over twenty workshops under my belt, I’m writing a whole new set of Data Carpentry lessons specifically for atmosphere and ocean scientists (see here for a sneak peak).

The contrast between this process and that typically followed by university lecturers could not be more stark. While contributing back to the The Carpentries lesson materials can be a little tedious at times, it is WAY less work (and results in a much superior product) than if I had to develop the materials for my workshops myself. Most lecturers get some hand-me-down materials from the staff member that went before them (if they’re lucky), and then they’re on their own. What’s more, anything they learn about teaching their discipline better has no impact beyond the four walls of their own classroom.

By following the community-based lead of The Carpentries, the quality of teaching at universities could be improved, while at the same time saving substantial time for lecturers. For instance, I can think of at least five universities that teach detailed courses on the weather and climate of Australia. If the teachers of these courses got together (perhaps facilitated by the Australian Meteorological and Oceanographic Society) and started to collaboratively develop and maintain a set of lesson materials, the educational experience for students (and all the good things that come from that; e.g. student retention, graduate quality and numbers) could improve markedly.

Of course, the one thing I’ve skipped over here is all the ingredients that make community lesson development work. What platforms are best for hosting the materials? What do you do when people disagree on the direction of the lessons? How do you structure the lessons for ease of use and contribution? To try and assist people with this, a bunch of Carpentries people (myself included) got together recently and published Ten Simple Rules for Collaborative Lesson Development. It distills everything we’ve learned over the years, and is hopefully a useful resource for anyone thinking of giving it a try.



February 8, 2018 / Damien Irving

Semantic versioning for our major climate reports?

I’m attending the joint AMOS / ICSHMO Conference in Sydney this week, where among other topics there’s been a lively discussion about preliminary plans for the next generation of climate projections for Australia.

In the past, new projections have been released by our major national science agencies (i.e. the Bureau of Meteorology and CSIRO) every 5-10 years: 1992, 1996, 2001, 2007 and 2015 (Whetton et al, 2016). This approach seems appropriate for the 1990s, but with such widespread use of the internet nowadays and the rapid pace of advances in climate research, the custom of publishing a periodical static report (and/or website) seems somewhat dated.

An alternative approach would be to follow the lead of the software development community. After a software package is released, it is common practice to keep track of subsequent updates and improvements via the use of semantic versioning. The convention for labeling each new version release is MAJOR.MINOR.PATCH (e.g. version 2.5.3), where you increment the:

  • MAJOR version when you make incompatible API changes,
  • MINOR version when you add functionality in a backwards-compatible manner, and
  • PATCH version when you make backwards-compatible bug fixes

The authors of the software then update the changelog for the project, which lists the notable changes for each new version. (The new changelog entry is often emailed to users of the software.)

It’s easy to imagine a situation where national climate projections could be posted online (e.g., with updates released on an as-needs basis by increments to the:

  • MAJOR version when there is a fundamental change to the projections themselves (e.g. a critical new dataset becomes available, like CMIP6)
  • MINOR version when content is added for a highly topical issue (e.g. an international climate target/agreement is reached, a major climate event occurs such as the warming hiatus)
  • PATCH version when minor research updates are added

The most obvious advantage of this approach is that frequent incremental updates ensure that the projections remain up-to-date with both user demands and the latest science. Most of the work would still get done in the lead up to the release of a new major version, but incremental improvements could be made in the meantime, rather than waiting years for the next report.

October 24, 2017 / Damien Irving

Best practices for scientific software

Code written by a research scientist typically lies somewhere on a continuum ranging from “scientific code” that was simply hacked together for individual use (e.g. to produce a figure for a journal paper) to “scientific software” that has been formally packaged and released for use by the wider community.

I’ve written at length (e.g. how to write a reproducible paper) about the best practices that apply to the scientific code end of the spectrum, so in this post I wanted to turn my attention to scientific software. In other words, what’s involved in turning scientific code into something that anyone can use?

My attempt at answering this question is based on my experiences as an Associate Editor with the Journal of Open Research Software. I’m focusing on Python since (a) most new scientific software in the weather/ocean/climate sciences is written in that language, and (b) it’s the language I’m most familiar with.


First off, you’ll need to create a repository on a site like GitHub or Bitbucket to host your (version controlled) software. As well as providing the means to make your code available to the community, these sites have features that help with things like community discussion and software release management. One of the first things you’ll need to include in your repository is a software license. Jake VanderPlas has an excellent post on why you need a license and how to pick one.

Packaging / installation

If you want people to use your software, you need to make it as easy as possible for them to install it. In Python, this means packaging the code in such a way that it can be made available via the Python Package Index (PyPI). If your code and all the libraries it depends on are written purely in Python, then this is all you need to do. People will simply be able to “pip install” your software from the command line.

If your software has non-Python dependencies (e.g. netCDF libraries), then it’s a good idea to make sure that it can also be installed via conda. Using recipes that developers (i.e. you, in this case) submit to conda-forge, this popular package manager installs software and all it’s dependencies at once. I’ve talked extensively about conda in a previous post.


While it might seem like the documentation pages for your favourite Python libraries were painstakingly typed by hand, they were almost certainly created using software that automatically takes all the information from the docstrings in your code and formats them nicely for display on the web. In most cases, people use Sphinx to generate the documentation and Read the Docs to publish it (here’s a nice description of that whole process).


In providing assistance to users, software projects will typically use a combination of encouraging people to submit issues on their GitHub/Bitbucket page (for technical questions that will possibly require a change to the code) and platforms like Google Groups and/or Gitter (a chat client provided by GitLab) for more general questions about how to use the software.

The bonus of GitHub issues, Google Groups and Gitter is that anyone can view the questions and answers, not just the lead developers of the software. This means that random people from the community can chime in with answers (reducing your workload) and it also helps reduce the incidence of getting the same question from many people.


If you want users (and your future self) to trust that your code actually works, you’ll need to develop a suite of tests using one of the many testing libraries available in Python. You can then use a platform like Travis CI to automatically run those tests each time you change your code, to make sure you haven’t broken anything. Many people add a little code coverage badge to the README file in their code repository using Coveralls, to indicate how much of the code is covered by the tests.

Academic publishing

To make sure you get the academic credit you deserve for the hard work associated with releasing and maintaining scientific software, it’s important to publish an academic article about your software (i.e. so that people can cite it in the methods sections of their papers). If there isn’t an existing journal dedicated to the type of software you’ve written (e.g. Geoscientific Model Development), then the Journal of Open Research Software or Journal of Open Source Software are good options.


This is obviously a very broad overview of what’s involved in packaging and releasing scientific software. Depending on where you sit on the scientific code / scientific software spectrum, not all of the things listed above will be necessary. For instance, if you’re writing code that only needs to be used by a group of 5 people working on the same computer system, hosting on GitHub, testing using Travis CI and the use of GitHub issues and gitter for discussion might be useful, but perhaps not packaging with PyPI or a journal paper with the Journal of Open Research Software.

A great resource for more detailed advice is the Software Sustainability Institute (their online guides are particularly useful). It’s also worth checking out the gold standards in the weather/ocean/climate space. In terms of individual researchers releasing their own software, this would be the eofs and windspharm packages from Andrew Dawson. Packages like MetPy (UCAR / Unidata), Py-ART (ARM Climate Research Facility) and Iris / Cartopy (MetOffice) are good examples of what can be achieved with some institutional support.

October 20, 2017 / Damien Irving

Talk Python To Me

I’m a big fan of the Talk Python To Me podcast, so I was very excited to be invited on the show this week to record an episode about how Python is used in climate science!

If you like the podcast, the episodes with Jonah Duckles from Software Carpentry and Travis Oliphant from Continuum Analytics are super interesting. I’ve also added some new entries to my list of weather/climate science podcasts, so there’s plenty out there to listen to! 🙂

May 8, 2017 / Damien Irving

A vision for CMIP6 in Australia

Most climate researchers would be well aware that phase 6 of the Climate Model Intercomparison Project (CMIP6) is now underway. The experiments have been designed, the modelling groups are gearing up to run them, and data should begin to come online sometime next year (see this special issue of Geoscientific Model Development for project details). As is always the case with a new iteration of CMIP, this one is going to be bigger and better than the last. By better I mean cooler experiments and improved model documentation (via the shiny new Earth System Documentation website), and by bigger I mean more data. At around 3 Petabytes in total size, CMIP5 was already so big that it was impractical for most individual research institutions to host their own copy. In Australia, the major climate research institutions (e.g. Bureau of Meteorology, CSIRO, ARC Centre of Excellence for Climate System Science – ARCCSS) got around this problem by enlisting the help of the National Computational Infrastructure (NCI) in Canberra. A similar arrangement is currently being planned for CMIP6, so I wanted to share my views (as someone who has spent a large part of the last decade wrangling CMIP3 and CMIP5 data) on what is required to help Australian climate researchers analyse that data with a minimum of fuss.

(Note: In focusing solely on researcher-related issues, I’m obviously ignoring vitally important technical issues related to data storage and funding issues etc. Assuming all that gets sorted, this post looks at how the researcher experience might be improved.)


1. A place to analyse the data

In addition to its sheer size, it’s important to note that the CMIP6 dataset will be in flux for many years as modelling groups begin to contribute data (and then revise and re-issue erroneous data) from 2018 onwards. For both these reasons, it’s not practical for individual researchers and/or institutions to be creating their own duplicate copies of the dataset. Recognising this issue (which is not unique to the CMIP projects), NCI have built a whole computational infrastructure directly on top of their data library, so that researchers can do their data processing without having to copy/move data anywhere. This computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have their Virtual Desktop Infrastructure. These virtual desktops have more grunt than your personal laptop or desktop computer (4 CPUs, 20 GB RAM, 66 GB storage) and come with a whole bunch of data exploration tools pre-installed. Better still, they are isolated from the rest of the system in the sense that unlike when you’re working on Raijin (or any other shared supercomputer), you don’t have to submit processes that will take longer than 15 or so minutes to the queuing system. I’ve found the virtual desktops to be ideal for analysing CMIP5 data (I do all my CMIP5 data analysis on them, including large full-depth ocean data processing) and can’t see any reason why they wouldn’t be equally suitable for CMIP6.


2. A way to locate and download data

Once you’ve logged into a virtual desktop, you need to be able to (a) locate the CMIP data of interest that’s already been downloaded to the NCI data library, and (b) find out if there’s data of interest available elsewhere on the international Earth System Grid. In the case of CMIP5, Paola Petrelli (with help from the rest of the Computational Modelling Support team at the ARCCSS) has developed an excellent package called ARCCSSive that does both these things. For data located elsewhere on the grid, it also gives you the option of automatically sending a request to Paola for the data to be downloaded to the NCI data library. (They also have a great help channel on Slack if you get stuck and have questions.)

Developing and maintaining a package like ARCCSSive is no trivial task, particularly as the Earth System Grid Federation (ESGF) continually shift the goalposts by tweaking and changing the way the data is made available. In my opinion, one of the highest priority tasks for CMIP6 would be to develop and maintain an ARCCSSive-like tool that researchers can use for data lookup and download requests.


3. A way to systematically report and handle errors in the data

Before a data file is submitted to a CMIP project, it is supposed to have undergone a series of checks to ensure that the data values are reasonable (e.g. nothing crazy like a negative rainfall rate) and that the metadata meets community agreed standards. Despite these checks, data errors and metadata inconsistencies regularly slip through the cracks and many hours of research time is spent weeding out and correcting these issues. For CMIP5, there is a process (I think) for notifying the relevant modelling group (via the ESGF maybe?) of an error you’ve found, but it will be many months (if ever) before a file gets corrected and re-issued. For easy-to-fix errors, researchers will therefore often generate a fixed file (which is only available in their personal directories on the NCI system) and then move on with their analysis.

The obvious problem with this sequence is that the original file hasn’t been flagged as erroneous (and no details of how to fix it archived), which means the next researcher who comes along will experience the same problem all over again. The big improvement I think we can make between CMIP5 and CMIP6 is a community effort to flag erroneous files, share suggested fixes and ultimately provide temporary corrected data files until the originals are re-issued. This is something the Australian community has talked about for CMIP5, but the farthest we got was a wiki that is not widely used. (Paola has also added warning/errata functionality to the ARCCSSive package so that users can filter out bad data.)

In an ideal world, the ESGF would coordinate this effort. I’m imagining a GitHub page where CMIP6 users from around the world could flag data errors and for simple cases submit code that fixes the problem. A group of global maintainers could then review these submissions, run accepted code on problematic data files and provide a “corrected” data collection for download. As part of the ESGF, the NCI could push for the launch of such an initiative. If it turns out that the ESGF is unwilling or unable, NCI could facilitate a similar process just for Australia (i.e. community fixes for the CMIP data that’s available in the NCI data libary).


4. Community maintained code for common tasks

Many Australian researchers perform the same CMIP data analysis tasks (e.g. calculate the Nino 3.4 index from sea surface temperature data or the annual mean surface temperature over Australia), which means there’s a fairly large duplication of effort across the community. To try and tackle this problem, computing support staff from the Bureau of Meteorology and CSIRO launched the CWSLab workflow tool, which was an attempt to get the climate community to share and collaboratively develop code for these common tasks. I actually took a one-month break during my PhD to work on that project and even waxed poetic about it in a previous post. I still love the idea in principle (and commend the BoM and CSIRO for making their code openly available), but upon reflection I feel like it’s a little ahead of its time. The broader climate community is still coming to grips with the idea of managing its personal code with a version control system; it’s a pretty big leap to utilising and contributing to an open source community project on GitHub, and that’s before we even get into the complexities associated with customising the VisTrails workflow management system used by the CWSLab workflow tool. I’d much prefer to see us aim to get a simple community error handling process off the ground first, and once the culture of code sharing and community contribution is established the CWSLab workflow tool could be revisited.


In summary, as we look towards CMIP6 in Australia, here’s how things look from the perspective of a scientist who’s been wrangling CMIP data for years:

  1. The NCI virtual desktops are ready to go and fit for purpose
  2. The ARCCSS software for locating and downloading CMIP5 data is fantastic. Developing and maintaining a similar tool for CMIP6 should be a high priority.
  3. The ESGF (or failing that, NCI) could lead a community-wide effort to identify and fix bogus CMIP data files
  4. A community maintained code repository for common data processing tasks (i.e. the CWSLab workflow tool) is an idea that is probably ahead of its time
April 11, 2017 / Damien Irving

Attention scientists: Frustrated with politics? Pick a party and get involved.

The March for Science is coming up on 22 April, so I’m taking a quick detour from my regular focus on research best practice. I’ve been invited to speak at the march in Hobart, Australia, so I thought I’d share what I’m going to say…

In today’s world of alternative facts and hyper-partisan public debate, there are growing calls for scientists to get involved in politics. This might take the form of speaking out on your area of expertise, participating in a non-partisan advocacy group and/or getting involved with a political party. If you think the latter sounds like the least attractive option of the three, you’re not alone. Membership of political parties has been in decline for years, to the point where many sporting clubs have more members. While this might sound like a good reason not to join a political party, I’ve found that it means your involvement can have a bigger impact than ever before.

A little over twelve-months ago, I moved to Hobart to take up a postdoctoral fellowship. As part of a new start in a new town, I decided to get actively involved with the Tasmanian Greens. Fast forward a year and I’m now the Convenor of the Denison Branch of the Party. Bob Brown (the father of the environment movement in Australia) started his political career as a Member for Denison in the Tasmanian Parliament and our current representative (Cassy O’Connor MP) is the leader of the Tasmanian Greens, so it’s been a fascinating and humbling experience so far.

Upon taking the plunge into politics, the first thing that struck me was the overwhelming reliance on volunteers. The Tasmanian Greens have very few staff, which means there is an infinite number of ways for volunteers to get involved. If your motivation lies in changing party policy in your area of expertise, you can take a lead role in re-writing that policy and campaigning for the support of the membership. If you’re happy with party policy and want to help achieve outcomes, your professional skills can definitely be put to good use. My data science skills have been in particularly high demand, and I’m now busily involved in managing our database of members and supporters. Besides this practical contribution, the experience has also been great for my mental wellbeing. Rather than simply despair at the current state of politics (which most often means ranting to like-minded friends and followers on social media), I now have an outlet for actively improving the situation.

If you’re a scientist (or simply someone who cares about the importance of knowledge, evidence and objectivity in the political process) and aren’t currently involved with a political party, I’d highly recommend giving it a go. Any party would benefit from the unique knowledge and skills you bring to the table. As with most volunteer experiences, you’ll also get out a whole lot more than you put in.

There are going to be over 400 marches around the world, so check the map and get along to the one nearest you (or better still, contact the organiser and offer to speak).