Skip to content
September 10, 2018 / Damien Irving

Data Carpentry for atmosphere and ocean scientists

This post originally appeared on the Data Carpentry blog.

Back in late 2012, I was a couple of years into my first job out of college. My undergraduate studies had left me somewhat underprepared for the coding associated with analysing climate model data for a national science organisation, so I was searching online for assistance with Python programming. I stumbled upon the website of an organisation called Software Carpentry, which at the time was a relatively small group of volunteers running two-day scientific computing “bootcamps” for researchers. I reached out to ask if they’d be interested in running a workshop alongside the 2013 Annual Conference of the Australian Meteorological and Oceanographic Society (AMOS), and to my surprise Greg Wilson – the co-founder of the organisation – flew out to Australia to teach at our event in Melbourne and another in Sydney (the first ever bootcamps outside of North America and Europe). I trained up as an instructor soon after, and from 2014-2017 I hosted Software Carpentry workshops alongside the AMOS conference, as well as other ad hoc workshops in various meteorology and oceanography departments.

While these workshops were very popular and well received (Software Carpentry workshops always are), in the back of my mind I wanted to have a go at running a workshop designed specifically for atmosphere and ocean scientists. Instead of teaching generic skills in the hope that people would figure out how to apply them in their own context, I wanted to cut out the middle step and run a workshop in the atmosphere and ocean science context. This idea of discipline (or data-type) specific workshops was the driving force behind the establishment of Data Carpentry, so this year with their assistance I’ve developed lesson materials for a complete one-day workshop:

The workshop centers around the task of writing a Python script that calculates and plots the seasonal rainfall climatology (i.e. the average rainfall) from the output from any arbitrary climate model. Such data is typically stored in netCDF file format and follows a strict “climate and forecasting” metadata convention. Along the way, we learn about the PyAOS stack (i.e. the ecosystem of libraries used in the atmosphere and ocean sciences), how to manage and share a software environment using conda, how to write modular/reusable code, how to write scripts that behave like other command line programs, version control, defensive programming strategies and how to capture and record the provenance of the data files and figures that we produce.

I’ve run the workshop twice now (at the 2018 AMOS Conference in Sydney and at Woods Hole Oceanographic Institution last month), which means I’ve completed the alpha stage of the Data Carpentry lesson development cycle. Moving from the alpha to beta stage involves having people other than me teach, which is where you come in. If you’re a qualified Carpentries instructor and would be interested in teaching the lessons (some experience with the netCDF file format and xarray Python library is useful), please get in touch with either myself or Francois Michonneau (Curriculum Development Lead for Data Carpentry). You can also request a workshop at your institution by contacting us and we’ll reach out to instructors. There is no fee for a pilot workshop, but you would need to cover travel expenses for instructors. I’d also be happy to hear any general feedback about the lesson materials at the associated GitHub repository.

March 16, 2018 / Damien Irving

Collaborative lesson development

If you’re a regular reader of this blog, you’ve no doubt heard me talk about The Carpentries – a global community committed to teaching foundational computational and data science skills to researchers. Hundreds of Software Carpentry and Data Carpentry workshops are held around the world every year, which is a monumental effort for an initiative that only got started in earnest five or so years ago.

While these workshops have had a massive impact on the computational literacy of the research community, in my opinion the most revolutionary thing about The Carpentries is not what we teach, but how we teach it. More specifically, the revolution lies in what we do in the background to develop and maintain our lessons. When a volunteer instructor is preparing to teach a workshop, they have an extensive collection of open and easily accessible lesson materials to select from. Unlike a static textbook produced by a small group of authors, these lessons are continually refined and updated by a large and diverse community of contributors, which means they are (by a wide margin) the state-of-the-art lessons in their discipline.

This process of community lesson development is probably best explained by reflecting on my own personal experience. I participated in The Carpentries instructor training program back in 2013, which among other things gave me a grounding in various evidence-based best-practices of teaching (i.e. an understanding of the fundamental pedagogical principles underpinning the lessons). Upon teaching my first few workshops, I started to contribute back to the lessons by fixing typos and other minor issues identified by participants. As I got more experience with teaching the materials, I started to make more substantive contributions, proposing and participating in re-writes of entire sections. Now with over twenty workshops under my belt, I’m writing a whole new set of Data Carpentry lessons specifically for atmosphere and ocean scientists (see here for a sneak peak).

The contrast between this process and that typically followed by university lecturers could not be more stark. While contributing back to the The Carpentries lesson materials can be a little tedious at times, it is WAY less work (and results in a much superior product) than if I had to develop the materials for my workshops myself. Most lecturers get some hand-me-down materials from the staff member that went before them (if they’re lucky), and then they’re on their own. What’s more, anything they learn about teaching their discipline better has no impact beyond the four walls of their own classroom.

By following the community-based lead of The Carpentries, the quality of teaching at universities could be improved, while at the same time saving substantial time for lecturers. For instance, I can think of at least five universities that teach detailed courses on the weather and climate of Australia. If the teachers of these courses got together (perhaps facilitated by the Australian Meteorological and Oceanographic Society) and started to collaboratively develop and maintain a set of lesson materials, the educational experience for students (and all the good things that come from that; e.g. student retention, graduate quality and numbers) could improve markedly.

Of course, the one thing I’ve skipped over here is all the ingredients that make community lesson development work. What platforms are best for hosting the materials? What do you do when people disagree on the direction of the lessons? How do you structure the lessons for ease of use and contribution? To try and assist people with this, a bunch of Carpentries people (myself included) got together recently and published Ten Simple Rules for Collaborative Lesson Development. It distills everything we’ve learned over the years, and is hopefully a useful resource for anyone thinking of giving it a try.



February 8, 2018 / Damien Irving

Semantic versioning for our major climate reports?

I’m attending the joint AMOS / ICSHMO Conference in Sydney this week, where among other topics there’s been a lively discussion about preliminary plans for the next generation of climate projections for Australia.

In the past, new projections have been released by our major national science agencies (i.e. the Bureau of Meteorology and CSIRO) every 5-10 years: 1992, 1996, 2001, 2007 and 2015 (Whetton et al, 2016). This approach seems appropriate for the 1990s, but with such widespread use of the internet nowadays and the rapid pace of advances in climate research, the custom of publishing a periodical static report (and/or website) seems somewhat dated.

An alternative approach would be to follow the lead of the software development community. After a software package is released, it is common practice to keep track of subsequent updates and improvements via the use of semantic versioning. The convention for labeling each new version release is MAJOR.MINOR.PATCH (e.g. version 2.5.3), where you increment the:

  • MAJOR version when you make incompatible API changes,
  • MINOR version when you add functionality in a backwards-compatible manner, and
  • PATCH version when you make backwards-compatible bug fixes

The authors of the software then update the changelog for the project, which lists the notable changes for each new version. (The new changelog entry is often emailed to users of the software.)

It’s easy to imagine a situation where national climate projections could be posted online (e.g., with updates released on an as-needs basis by increments to the:

  • MAJOR version when there is a fundamental change to the projections themselves (e.g. a critical new dataset becomes available, like CMIP6)
  • MINOR version when content is added for a highly topical issue (e.g. an international climate target/agreement is reached, a major climate event occurs such as the warming hiatus)
  • PATCH version when minor research updates are added

The most obvious advantage of this approach is that frequent incremental updates ensure that the projections remain up-to-date with both user demands and the latest science. Most of the work would still get done in the lead up to the release of a new major version, but incremental improvements could be made in the meantime, rather than waiting years for the next report.

October 24, 2017 / Damien Irving

Best practices for scientific software

Code written by a research scientist typically lies somewhere on a continuum ranging from “scientific code” that was simply hacked together for individual use (e.g. to produce a figure for a journal paper) to “scientific software” that has been formally packaged and released for use by the wider community.

I’ve written at length (e.g. how to write a reproducible paper) about the best practices that apply to the scientific code end of the spectrum, so in this post I wanted to turn my attention to scientific software. In other words, what’s involved in turning scientific code into something that anyone can use?

My attempt at answering this question is based on my experiences as an Associate Editor with the Journal of Open Research Software. I’m focusing on Python since (a) most new scientific software in the weather/ocean/climate sciences is written in that language, and (b) it’s the language I’m most familiar with.


First off, you’ll need to create a repository on a site like GitHub or Bitbucket to host your (version controlled) software. As well as providing the means to make your code available to the community, these sites have features that help with things like community discussion and software release management. One of the first things you’ll need to include in your repository is a software license. Jake VanderPlas has an excellent post on why you need a license and how to pick one.

Packaging / installation

If you want people to use your software, you need to make it as easy as possible for them to install it. In Python, this means packaging the code in such a way that it can be made available via the Python Package Index (PyPI). If your code and all the libraries it depends on are written purely in Python, then this is all you need to do. People will simply be able to “pip install” your software from the command line.

If your software has non-Python dependencies (e.g. netCDF libraries), then it’s a good idea to make sure that it can also be installed via conda. Using recipes that developers (i.e. you, in this case) submit to conda-forge, this popular package manager installs software and all it’s dependencies at once. I’ve talked extensively about conda in a previous post.


While it might seem like the documentation pages for your favourite Python libraries were painstakingly typed by hand, they were almost certainly created using software that automatically takes all the information from the docstrings in your code and formats them nicely for display on the web. In most cases, people use Sphinx to generate the documentation and Read the Docs to publish it (here’s a nice description of that whole process).


In providing assistance to users, software projects will typically use a combination of encouraging people to submit issues on their GitHub/Bitbucket page (for technical questions that will possibly require a change to the code) and platforms like Google Groups and/or Gitter (a chat client provided by GitLab) for more general questions about how to use the software.

The bonus of GitHub issues, Google Groups and Gitter is that anyone can view the questions and answers, not just the lead developers of the software. This means that random people from the community can chime in with answers (reducing your workload) and it also helps reduce the incidence of getting the same question from many people.


If you want users (and your future self) to trust that your code actually works, you’ll need to develop a suite of tests using one of the many testing libraries available in Python. You can then use a platform like Travis CI to automatically run those tests each time you change your code, to make sure you haven’t broken anything. Many people add a little code coverage badge to the README file in their code repository using Coveralls, to indicate how much of the code is covered by the tests.

Academic publishing

To make sure you get the academic credit you deserve for the hard work associated with releasing and maintaining scientific software, it’s important to publish an academic article about your software (i.e. so that people can cite it in the methods sections of their papers). If there isn’t an existing journal dedicated to the type of software you’ve written (e.g. Geoscientific Model Development), then the Journal of Open Research Software or Journal of Open Source Software are good options.


This is obviously a very broad overview of what’s involved in packaging and releasing scientific software. Depending on where you sit on the scientific code / scientific software spectrum, not all of the things listed above will be necessary. For instance, if you’re writing code that only needs to be used by a group of 5 people working on the same computer system, hosting on GitHub, testing using Travis CI and the use of GitHub issues and gitter for discussion might be useful, but perhaps not packaging with PyPI or a journal paper with the Journal of Open Research Software.

A great resource for more detailed advice is the Software Sustainability Institute (their online guides are particularly useful). It’s also worth checking out the gold standards in the weather/ocean/climate space. In terms of individual researchers releasing their own software, this would be the eofs and windspharm packages from Andrew Dawson. Packages like MetPy (UCAR / Unidata), Py-ART (ARM Climate Research Facility) and Iris / Cartopy (MetOffice) are good examples of what can be achieved with some institutional support.

October 20, 2017 / Damien Irving

Talk Python To Me

I’m a big fan of the Talk Python To Me podcast, so I was very excited to be invited on the show this week to record an episode about how Python is used in climate science!

If you like the podcast, the episodes with Jonah Duckles from Software Carpentry and Travis Oliphant from Continuum Analytics are super interesting. I’ve also added some new entries to my list of weather/climate science podcasts, so there’s plenty out there to listen to! 🙂

May 8, 2017 / Damien Irving

A vision for CMIP6 in Australia

Most climate researchers would be well aware that phase 6 of the Climate Model Intercomparison Project (CMIP6) is now underway. The experiments have been designed, the modelling groups are gearing up to run them, and data should begin to come online sometime next year (see this special issue of Geoscientific Model Development for project details). As is always the case with a new iteration of CMIP, this one is going to be bigger and better than the last. By better I mean cooler experiments and improved model documentation (via the shiny new Earth System Documentation website), and by bigger I mean more data. At around 3 Petabytes in total size, CMIP5 was already so big that it was impractical for most individual research institutions to host their own copy. In Australia, the major climate research institutions (e.g. Bureau of Meteorology, CSIRO, ARC Centre of Excellence for Climate System Science – ARCCSS) got around this problem by enlisting the help of the National Computational Infrastructure (NCI) in Canberra. A similar arrangement is currently being planned for CMIP6, so I wanted to share my views (as someone who has spent a large part of the last decade wrangling CMIP3 and CMIP5 data) on what is required to help Australian climate researchers analyse that data with a minimum of fuss.

(Note: In focusing solely on researcher-related issues, I’m obviously ignoring vitally important technical issues related to data storage and funding issues etc. Assuming all that gets sorted, this post looks at how the researcher experience might be improved.)


1. A place to analyse the data

In addition to its sheer size, it’s important to note that the CMIP6 dataset will be in flux for many years as modelling groups begin to contribute data (and then revise and re-issue erroneous data) from 2018 onwards. For both these reasons, it’s not practical for individual researchers and/or institutions to be creating their own duplicate copies of the dataset. Recognising this issue (which is not unique to the CMIP projects), NCI have built a whole computational infrastructure directly on top of their data library, so that researchers can do their data processing without having to copy/move data anywhere. This computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have their Virtual Desktop Infrastructure. These virtual desktops have more grunt than your personal laptop or desktop computer (4 CPUs, 20 GB RAM, 66 GB storage) and come with a whole bunch of data exploration tools pre-installed. Better still, they are isolated from the rest of the system in the sense that unlike when you’re working on Raijin (or any other shared supercomputer), you don’t have to submit processes that will take longer than 15 or so minutes to the queuing system. I’ve found the virtual desktops to be ideal for analysing CMIP5 data (I do all my CMIP5 data analysis on them, including large full-depth ocean data processing) and can’t see any reason why they wouldn’t be equally suitable for CMIP6.


2. A way to locate and download data

Once you’ve logged into a virtual desktop, you need to be able to (a) locate the CMIP data of interest that’s already been downloaded to the NCI data library, and (b) find out if there’s data of interest available elsewhere on the international Earth System Grid. In the case of CMIP5, Paola Petrelli (with help from the rest of the Computational Modelling Support team at the ARCCSS) has developed an excellent package called ARCCSSive that does both these things. For data located elsewhere on the grid, it also gives you the option of automatically sending a request to Paola for the data to be downloaded to the NCI data library. (They also have a great help channel on Slack if you get stuck and have questions.)

Developing and maintaining a package like ARCCSSive is no trivial task, particularly as the Earth System Grid Federation (ESGF) continually shift the goalposts by tweaking and changing the way the data is made available. In my opinion, one of the highest priority tasks for CMIP6 would be to develop and maintain an ARCCSSive-like tool that researchers can use for data lookup and download requests.


3. A way to systematically report and handle errors in the data

Before a data file is submitted to a CMIP project, it is supposed to have undergone a series of checks to ensure that the data values are reasonable (e.g. nothing crazy like a negative rainfall rate) and that the metadata meets community agreed standards. Despite these checks, data errors and metadata inconsistencies regularly slip through the cracks and many hours of research time is spent weeding out and correcting these issues. For CMIP5, there is a process (I think) for notifying the relevant modelling group (via the ESGF maybe?) of an error you’ve found, but it will be many months (if ever) before a file gets corrected and re-issued. For easy-to-fix errors, researchers will therefore often generate a fixed file (which is only available in their personal directories on the NCI system) and then move on with their analysis.

The obvious problem with this sequence is that the original file hasn’t been flagged as erroneous (and no details of how to fix it archived), which means the next researcher who comes along will experience the same problem all over again. The big improvement I think we can make between CMIP5 and CMIP6 is a community effort to flag erroneous files, share suggested fixes and ultimately provide temporary corrected data files until the originals are re-issued. This is something the Australian community has talked about for CMIP5, but the farthest we got was a wiki that is not widely used. (Paola has also added warning/errata functionality to the ARCCSSive package so that users can filter out bad data.)

In an ideal world, the ESGF would coordinate this effort. I’m imagining a GitHub page where CMIP6 users from around the world could flag data errors and for simple cases submit code that fixes the problem. A group of global maintainers could then review these submissions, run accepted code on problematic data files and provide a “corrected” data collection for download. As part of the ESGF, the NCI could push for the launch of such an initiative. If it turns out that the ESGF is unwilling or unable, NCI could facilitate a similar process just for Australia (i.e. community fixes for the CMIP data that’s available in the NCI data libary).


4. Community maintained code for common tasks

Many Australian researchers perform the same CMIP data analysis tasks (e.g. calculate the Nino 3.4 index from sea surface temperature data or the annual mean surface temperature over Australia), which means there’s a fairly large duplication of effort across the community. To try and tackle this problem, computing support staff from the Bureau of Meteorology and CSIRO launched the CWSLab workflow tool, which was an attempt to get the climate community to share and collaboratively develop code for these common tasks. I actually took a one-month break during my PhD to work on that project and even waxed poetic about it in a previous post. I still love the idea in principle (and commend the BoM and CSIRO for making their code openly available), but upon reflection I feel like it’s a little ahead of its time. The broader climate community is still coming to grips with the idea of managing its personal code with a version control system; it’s a pretty big leap to utilising and contributing to an open source community project on GitHub, and that’s before we even get into the complexities associated with customising the VisTrails workflow management system used by the CWSLab workflow tool. I’d much prefer to see us aim to get a simple community error handling process off the ground first, and once the culture of code sharing and community contribution is established the CWSLab workflow tool could be revisited.


In summary, as we look towards CMIP6 in Australia, here’s how things look from the perspective of a scientist who’s been wrangling CMIP data for years:

  1. The NCI virtual desktops are ready to go and fit for purpose
  2. The ARCCSS software for locating and downloading CMIP5 data is fantastic. Developing and maintaining a similar tool for CMIP6 should be a high priority.
  3. The ESGF (or failing that, NCI) could lead a community-wide effort to identify and fix bogus CMIP data files
  4. A community maintained code repository for common data processing tasks (i.e. the CWSLab workflow tool) is an idea that is probably ahead of its time
April 11, 2017 / Damien Irving

Attention scientists: Frustrated with politics? Pick a party and get involved.

The March for Science is coming up on 22 April, so I’m taking a quick detour from my regular focus on research best practice. I’ve been invited to speak at the march in Hobart, Australia, so I thought I’d share what I’m going to say…

In today’s world of alternative facts and hyper-partisan public debate, there are growing calls for scientists to get involved in politics. This might take the form of speaking out on your area of expertise, participating in a non-partisan advocacy group and/or getting involved with a political party. If you think the latter sounds like the least attractive option of the three, you’re not alone. Membership of political parties has been in decline for years, to the point where many sporting clubs have more members. While this might sound like a good reason not to join a political party, I’ve found that it means your involvement can have a bigger impact than ever before.

A little over twelve-months ago, I moved to Hobart to take up a postdoctoral fellowship. As part of a new start in a new town, I decided to get actively involved with the Tasmanian Greens. Fast forward a year and I’m now the Convenor of the Denison Branch of the Party. Bob Brown (the father of the environment movement in Australia) started his political career as a Member for Denison in the Tasmanian Parliament and our current representative (Cassy O’Connor MP) is the leader of the Tasmanian Greens, so it’s been a fascinating and humbling experience so far.

Upon taking the plunge into politics, the first thing that struck me was the overwhelming reliance on volunteers. The Tasmanian Greens have very few staff, which means there is an infinite number of ways for volunteers to get involved. If your motivation lies in changing party policy in your area of expertise, you can take a lead role in re-writing that policy and campaigning for the support of the membership. If you’re happy with party policy and want to help achieve outcomes, your professional skills can definitely be put to good use. My data science skills have been in particularly high demand, and I’m now busily involved in managing our database of members and supporters. Besides this practical contribution, the experience has also been great for my mental wellbeing. Rather than simply despair at the current state of politics (which most often means ranting to like-minded friends and followers on social media), I now have an outlet for actively improving the situation.

If you’re a scientist (or simply someone who cares about the importance of knowledge, evidence and objectivity in the political process) and aren’t currently involved with a political party, I’d highly recommend giving it a go. Any party would benefit from the unique knowledge and skills you bring to the table. As with most volunteer experiences, you’ll also get out a whole lot more than you put in.

There are going to be over 400 marches around the world, so check the map and get along to the one nearest you (or better still, contact the organiser and offer to speak).

February 15, 2017 / Damien Irving

The research police

You know who I’m talking about. I’m sure every research community has them. Those annoying do-gooders who constantly advocate for things to be done the right way. When you’re trying to take a shortcut, it’s their nagging voice in the back of your mind. You appreciate that what they’re saying is important, but with so much work and so little time, you don’t always want to hear it. Since I’m fond of creating lists on this blog, here’s my research police of the weather, ocean and climate sciences:



Dan Wilks is a widely regarded statistics guru in the atmospheric sciences. He is the author of the most clearly written statistics textbook I’ve ever come across, as well as great articles such as this recent essay in BAMS, which is sure to make you feel bad if you’ve ever plotted significance stippling.


Data visualisation

Ed Hawkins’ climate spiral visualisation received worldwide media coverage in 2016 (and even featured in the opening ceremony of the Rio Olympics). He makes the list of research police due to his end the rainbow campaign, which advocates for the use of more appropriate colour scales in climate science.



David Schultz is the Chief Editor of Monthly Weather Review and has authored well over 100 research articles, but is probably best known as the “Eloquent Science guy.” His book and blog are a must read for anyone wanting to improve their academic writing, reviewing and speaking.



Unfortunately I’m going to have to self-nominate here, as I’ve been a strong advocate for publishing reproducible computational results for a number of years now (see related post and BAMS essay). To help researchers do this, I’ve taught at over 20 Software Carpentry workshops and I’m the lead author of their climate-specific lesson materials.


If I’ve missed any other research police, please let me know in comments!

January 11, 2017 / Damien Irving

Need help with reproducible research? These organisations have got you covered.

The reproducibility crisis in modern research is a multi-faceted problem. If you’re working in the life sciences, for instance, experimental design and poor statistical power are big issues. For the weather, ocean and climate sciences, the big issue is code and software availability. We don’t document the details of the code and software used to analyse and visualise our data, which means it’s impossible to interrogate our methods and reproduce our results.

(For the purposes of this post, research “software” is something that has been packaged and released for use by the wider community, whereas research “code” is something written just for personal use. For instance, I might have written some code to perform and plot an EOF analysis, which calls and executes functions from the eofs software package that is maintained by Andrew Dawson at Oxford University.)

Unbeknown to most weather, ocean and climate scientists, there are a number of groups out there that want to help you make your work more reproducible. Here’s a list of the key players and what they’re up to…


Software Sustainability Institute (SSI)

The SSI is the go-to organisation for people who write and maintain scientific software. They provide training and support, advocate for formal career paths for scientific software developers and manage the Journal of Open Research Software, where you can publish the details of your software so that people can cite your work. They focus mainly on researchers in the UK, so it’s my hope that organisations like SSI will start popping up in other countries around the world.



The OntoSoft project in the US has a bit of overlap with the SSI (e.g. they’re working on “software commons” infrastructure where people can submit their geoscientific software so that it can be searched and discovered by others), but in addition their Geoscientific Paper of the Future (GPF) initiative has been looking at the broader issue of how researchers should go about publishing the details of the digital aspects of their research (i.e. data, code, software and provenance/workflow). In a special GPF issue of Earth and Space Science, researchers from a variety of geoscience disciplines share their experiences in trying to document their digital research methods. The lead paper from that issue gives a fantastic overview of the options available to researchers. (My own work in this area gives a slightly more practical overview but in general covers many of the same ideas.)


Software Carpentry

The global network of volunteer Software Carpentry instructors run hundreds of two-day workshops around the world each year, teaching the skills needed to write reusable, testable and ultimately reproducible code (i.e. to do the things suggested by the GPF). Their teaching materials have been developed and refined for more than a decade and every instructor undergoes formal training, which means you won’t find a better learning experience anywhere. To get a workshop happening at your own institution, you simply need to submit a request at their website. They’ll then assist with finding local instructors and all the other logistics that go along with running a workshop. A sibling organisation called Data Carpentry has recently been launched, so it’s also worth checking to see if their more discipline-specific, data-centric lessons would be a better fit.


Mozilla Science Lab

Once you’ve walked out of a two-day Software Carpentry workshop, it can be hard to find ongoing support for your coding. The best form of support usually comes from an engaged and well connected local community, so the Mozilla Science Lab assists researchers in forming and maintaining in-person study groups. If there isn’t already a study group in your area, I’d highly recommend their study group handbook. It has a bunch of useful advice and resources for getting one started, plus they periodically run online orientation courses to go through the handbook content in detail.


Hopefully one or more of those organisations will be useful in your attempts to make your work more reproducible – please let me know in comments if there’s other groups/resources that I’ve missed!


October 4, 2016 / Damien Irving

The weather/climate Python stack

It would be an understatement to say that Python has exploded onto the data science scene in recent years. PyCon and SciPy conferences are held somewhere in the world every few months now, at which loads of new and/or improved data science libraries are showcased to the community. When the videos from these conferences are made available online (which is almost immediately at, I’m always filled with a mixture of joy and dread. The ongoing rapid development of new libraries means that data scientists are (hopefully) continually able to do more and more cool things with less and less time and effort, but at the same time it can be difficult to figure out how they all relate to one another. To assist in making sense of this constantly changing landscape, this post summarises the current state of the weather and climate Python software “stack” (i.e. the collection of libraries used for data analysis and visualisation). My focus is on libraries that are widely used and that have good (and likely long-term) support, but I’m happy to hear of others that you think I might have missed!


The weather/climate Python stack.



The dashed box in the diagram represents the core of the stack, so let’s start our tour there. The default library for dealing with numerical arrays in Python is NumPy. It has a bunch of built in functions for reading and writing common data formats like .csv, but if your data is stored in netCDF format then the default library for getting data into/out of those files is netCDF4.

Once you’ve read your data in, you’re probably going to want to do some statistical analysis. The NumPy library has some built in functions for calculating very simple statistics (e.g. maximum, mean, standard deviation), but for more complex analysis (e.g. interpolation, integration, linear algebra) the SciPy library is the default.

The NumPy library doesn’t come with any plotting capability, so if you want to visualise your NumPy data arrays then the default library is matplotlib. As you can see at the matplotlib gallery, this library is great for any simple (e.g. bar charts, contour plots, line graphs), static (e.g. .png, .eps, .pdf) plots. The cartopy library provides additional functionality for common map projections, while Bokeh allows for the creation of interactive plots where you can zoom and scroll.

While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their highly flexible, all-purpose nature means relatively common/simple tasks can often require quite a bit of work (i.e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These additional libraries aren’t as flexible – they can’t do everything like the core stack can – but they can do common tasks with far less effort…


Generic additions

Let’s first consider the generic additional libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index (as is required with NumPy), the actual row and column headings can be used. That means Fred’s height could be obtained from a medical dataset by asking for data[‘Fred’, ‘height’], rather than having to remember the numeric index corresponding to that person and characteristic. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with SciPy and matplotlib, greatly simplifies the code development process (read: less lines of code).

One of the limitations of pandas is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available (which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude (e.g. 20 South), longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, makes the xarray data array far easier to deal with than the NumPy array. (As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.) With the recent announcement of funding for the Pangeo project, the future is bright in terms of further development of xarray for the purposes of big data analysis in the geosciences. (That project will also support the development of some discipline specific libraries for things like thermodynamics, regridding and vector calculus.)


Discipline-specific additions

While the xarray library is a good option for those working in the weather and climate sciences (especially those dealing with large multi-dimensional arrays from model simulations), the team of software developers at the MetOffice have taken a different approach to building on top of the core stack. Rather than striving to make their software generic (xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their Iris library are dealing with weather/climate data. Doing this allows them to make common weather/climate tasks super quick and easy, and it also means they have added lots of useful functions specific to weather/climate science.

In addition to Iris, you may also come across CDAT. It was the precursor to xarray and Iris in the sense that it was the first package for weather and climate scientists built on top of the core Python stack. The funding and direction of that project has since shifted towards developing a graphical interface for managing large workflows ans visualising data (i.e. as opposed to further developing the capabilities of the underlying Python libraries), which is why I now consider it to have been overtaken by Iris and xarray.

In terms of choosing between xarray and Iris, some people like the slightly more weather/climate-centric experience offered by Iris, while others don’t like the restrictions that places on their work and prefer the generic xarray experience (e.g. to use Iris your netCDF data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience.


Simplifying data exploration

While the plotting functionality associated with xarray and Iris speeds up the process of visually exploring data (as compared to matplotlib), there’s still a fair bit of messing around involved in tweaking the various aspects of a plot (e.g. colour schemes, plot size, labels, map projections, etc). This tweaking burden is an issue across all data science fields and programming languages, so developers of the latest generation of visualisation tools are moving towards something called declarative visualisation. The basic concept is that the user simply has to describe the characteristics of their data, and then the software figures out the optimal way to visualise it (i.e. it makes all the tweaking decisions for you).

The two major Python libraries in the declarative visualisation space are HoloViews and Altair. The former (which has been around much longer) uses matplotlib or Bokeh under the hood, which means it allows for the generation of static or interactive plots. Since HoloViews doesn’t have support for geographic plots, GeoViews has been created on top of it (which incorporates cartopy and can handle Iris or xarray data arrays).


Sub-discipline-specific libraries

So far we’ve considered libraries that do general, broad-scale tasks like data input/output, common statistics, visualisation, etc. Given their large user base, these libraries are usually written and supported by large companies (e.g. Continuum Analytics supports conda, Bokeh and HoloViews/Geoviews), large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) or the wider PyData community (e.g. pandas, xarray). Within each sub-discipline of weather and climate science, individuals and research groups take these libraries and apply them to their very specific data analysis tasks. Increasingly, these individuals and groups are formally packaging and releasing their code for use within their community. For instance, Andrew Dawson (an atmospheric scientist at Oxford) does a lot of EOF analysis and manipulation of wind data, so he has released his eofs and windspharm libraries (which are able to handle data arrays from NumPy, Iris or xarray). Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility have released their Python ARM Radar Toolkit (Py-ART) for analysing weather radar data, and a similar story is true for MetPy. It would be impossible to list all the sub-discipline-specific libraries in this post, but the PyAOS community is an excellent resource if you’re trying to find out what’s available in your area of research.


Installing the stack

While the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C, because speed is important when data arrays get really large. Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite scientific Python libraries working together used to be problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released a package manager called conda, which is able to handle non-Python dependencies. The documentation for almost all modern scientific Python packages will suggest that you use conda for installation.


Navigating the stack

All of the additional libraries discussed in this post essentially exist to hide the complexity of the core libraries (in software engineering this is known as abstraction). Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib. GeoViews was built to hide some of the complexity of Iris, cartopy and Bokeh. So if you want to start exploring your data, start at the top right of the stack and move your way down and left as required. If GeoViews doesn’t have quite the right functions for a particular plot that you want to create, drop down a level and use some Iris and cartopy functions. If Iris doesn’t have any functions for a statistical procedure that you want to apply, go back down another level and use SciPy. By starting at the top right and working your way back, you’ll ensure that you never re-invent the wheel. Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4) for extracting the metadata contained within a netCDF file, for instance, only to find that Iris automatically keeps this information upon reading a file. In this way, a solid working knowledge of the scientific Python stack can save you a lot of time and effort.