Skip to content
October 24, 2017 / Damien Irving

Best practices for scientific software

Code written by a research scientist typically lies somewhere on a continuum ranging from “scientific code” that was simply hacked together for individual use (e.g. to produce a figure for a journal paper) to “scientific software” that has been formally packaged and released for use by the wider community.

I’ve written at length (e.g. how to write a reproducible paper) about the best practices that apply to the scientific code end of the spectrum, so in this post I wanted to turn my attention to scientific software. In other words, what’s involved in turning scientific code into something that anyone can use?

My attempt at answering this question is based on my experiences as an Associate Editor with the Journal of Open Research Software. I’m focusing on Python since (a) most new scientific software in the weather/ocean/climate sciences is written in that language, and (b) it’s the language I’m most familiar with.


First off, you’ll need to create a repository on a site like GitHub or Bitbucket to host your (version controlled) software. As well as providing the means to make your code available to the community, these sites have features that help with things like community discussion and software release management. One of the first things you’ll need to include in your repository is a software license. Jake VanderPlas has an excellent post on why you need a license and how to pick one.

Packaging / installation

If you want people to use your software, you need to make it as easy as possible for them to install it. In Python, this means packaging the code in such a way that it can be made available via the Python Package Index (PyPI). If your code and all the libraries it depends on are written purely in Python, then this is all you need to do. People will simply be able to “pip install” your software from the command line.

If your software has non-Python dependencies (e.g. netCDF libraries), then it’s a good idea to make sure that it can also be installed via conda. Using recipes that developers (i.e. you, in this case) submit to conda-forge, this popular package manager installs software and all it’s dependencies at once. I’ve talked extensively about conda in a previous post.


While it might seem like the documentation pages for your favourite Python libraries were painstakingly typed by hand, they were almost certainly created using software that automatically takes all the information from the docstrings in your code and formats them nicely for display on the web. In most cases, people use Sphinx to generate the documentation and Read the Docs to publish it (here’s a nice description of that whole process).


In providing assistance to users, software projects will typically use a combination of encouraging people to submit issues on their GitHub/Bitbucket page (for technical questions that will possibly require a change to the code) and platforms like Google Groups and/or Gitter (a chat client provided by GitLab) for more general questions about how to use the software.

The bonus of GitHub issues, Google Groups and Gitter is that anyone can view the questions and answers, not just the lead developers of the software. This means that random people from the community can chime in with answers (reducing your workload) and it also helps reduce the incidence of getting the same question from many people.


If you want users (and your future self) to trust that your code actually works, you’ll need to develop a suite of tests using one of the many testing libraries available in Python. You can then use a platform like Travis CI to automatically run those tests each time you change your code, to make sure you haven’t broken anything. Many people add a little code coverage badge to the README file in their code repository using Coveralls, to indicate how much of the code is covered by the tests.

Academic publishing

To make sure you get the academic credit you deserve for the hard work associated with releasing and maintaining scientific software, it’s important to publish an academic article about your software (i.e. so that people can cite it in the methods sections of their papers). If there isn’t an existing journal dedicated to the type of software you’ve written (e.g. Geoscientific Model Development), then the Journal of Open Research Software or Journal of Open Source Software are good options.


This is obviously a very broad overview of what’s involved in packaging and releasing scientific software. Depending on where you sit on the scientific code / scientific software spectrum, not all of the things listed above will be necessary. For instance, if you’re writing code that only needs to be used by a group of 5 people working on the same computer system, hosting on GitHub, testing using Travis CI and the use of GitHub issues and gitter for discussion might be useful, but perhaps not packaging with PyPI or a journal paper with the Journal of Open Research Software.

A great resource for more detailed advice is the Software Sustainability Institute (their online guides are particularly useful). It’s also worth checking out the gold standards in the weather/ocean/climate space. In terms of individual researchers releasing their own software, this would be the eofs and windspharm packages from Andrew Dawson. Packages like MetPy (UCAR / Unidata), Py-ART (ARM Climate Research Facility) and Iris / Cartopy (MetOffice) are good examples of what can be achieved with some institutional support.

October 20, 2017 / Damien Irving

Talk Python To Me

I’m a big fan of the Talk Python To Me podcast, so I was very excited to be invited on the show this week to record an episode about how Python is used in climate science!

If you like the podcast, the episodes with Jonah Duckles from Software Carpentry and Travis Oliphant from Continuum Analytics are super interesting. I’ve also added some new entries to my list of weather/climate science podcasts, so there’s plenty out there to listen to! 🙂

August 16, 2017 / Damien Irving

A PyAOS masterclass

Immediately prior to the joint AMOS / ICSHMO conference in Sydney next February, I’ll be running a one day Python masterclass. The class will be targeted at people who are already using Python in the atmosphere and ocean sciences (i.e. “PyAOS”). They don’t necessarily need to be highly proficient in Python, but a strong familiarity with the syntax and basic constructs such as loops, lists and conditionals (i.e. if statements) is required.

In thinking about what topics to cover at the masterclass, I’ve decided to focus on a suite of programming concepts and best practices that aren’t so easy to glean from a quick Google search. In other words, we aren’t going spend time learning the specifics of how to create a beautiful plot using the iris library or calculate a monthly anomaly timeseries using xarray (although people will be exposed to some of this stuff along the way), because those are things that people can easily look up for themselves (e.g. here and here, in this case).

The basic rationale behind this decision is that the main difference between being an “adequate” and “advanced” Python programmer is not how familiar you are with the specifics of any particular library. Instead, it is your ability to (a) write readable, reusable and testable code, (b) construct workflows that are reproducible, and (c) to understand the scope, strengths and limitations of the scientific Python ecosystem. The latter is important not only for picking the right library for the right job, but also for understanding what is (and isn’t) possible in your data analysis.

Here’s an overview what I’m thinking of teaching at the moment. It’s basically a selection of short tutorials that I’ve written for other audiences over the past few years and lessons taken directly from Software Carpentry. I’m hoping the class will be something of a two-way conversation between myself and the attendees, as I certainly don’t profess to know everything there is to know in this area. Comments are very welcome!

  1. Managing your Python environment with Conda
  2. Interacting with Python: A quick overview of the various Interactive Development Environments and the Jupyter notebook
  3. Introduction to functions and modular code
  4. Version control using Git
  5. A practical data analysis example (which covers data management, data provenance and writing Python scripts that can be executed from the command line)
  6. Debugging and profiling
  7. Testing and defensive programming
  8. Tour of the PyAOS software stack

If time permits (and it probably won’t) I’d probably finish with a lesson on workflow management using Make.

As the AMOS / ICSHMO conference gets closer, I’ll be collating the content above into a complete one-day course for AOS scientists hosted by Data Carpentry. This is a sibling organisation to Software Carpentry that focuses on discipline specific lessons, and I’m hoping that by hosting the materials with them others can use the lessons and collaborate with me on their development into the future.

May 8, 2017 / Damien Irving

A vision for CMIP6 in Australia

Most climate researchers would be well aware that phase 6 of the Climate Model Intercomparison Project (CMIP6) is now underway. The experiments have been designed, the modelling groups are gearing up to run them, and data should begin to come online sometime next year (see this special issue of Geoscientific Model Development for project details). As is always the case with a new iteration of CMIP, this one is going to be bigger and better than the last. By better I mean cooler experiments and improved model documentation (via the shiny new Earth System Documentation website), and by bigger I mean more data. At around 3 Petabytes in total size, CMIP5 was already so big that it was impractical for most individual research institutions to host their own copy. In Australia, the major climate research institutions (e.g. Bureau of Meteorology, CSIRO, ARC Centre of Excellence for Climate System Science – ARCCSS) got around this problem by enlisting the help of the National Computational Infrastructure (NCI) in Canberra. A similar arrangement is currently being planned for CMIP6, so I wanted to share my views (as someone who has spent a large part of the last decade wrangling CMIP3 and CMIP5 data) on what is required to help Australian climate researchers analyse that data with a minimum of fuss.

(Note: In focusing solely on researcher-related issues, I’m obviously ignoring vitally important technical issues related to data storage and funding issues etc. Assuming all that gets sorted, this post looks at how the researcher experience might be improved.)


1. A place to analyse the data

In addition to its sheer size, it’s important to note that the CMIP6 dataset will be in flux for many years as modelling groups begin to contribute data (and then revise and re-issue erroneous data) from 2018 onwards. For both these reasons, it’s not practical for individual researchers and/or institutions to be creating their own duplicate copies of the dataset. Recognising this issue (which is not unique to the CMIP projects), NCI have built a whole computational infrastructure directly on top of their data library, so that researchers can do their data processing without having to copy/move data anywhere. This computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have their Virtual Desktop Infrastructure. These virtual desktops have more grunt than your personal laptop or desktop computer (4 CPUs, 20 GB RAM, 66 GB storage) and come with a whole bunch of data exploration tools pre-installed. Better still, they are isolated from the rest of the system in the sense that unlike when you’re working on Raijin (or any other shared supercomputer), you don’t have to submit processes that will take longer than 15 or so minutes to the queuing system. I’ve found the virtual desktops to be ideal for analysing CMIP5 data (I do all my CMIP5 data analysis on them, including large full-depth ocean data processing) and can’t see any reason why they wouldn’t be equally suitable for CMIP6.


2. A way to locate and download data

Once you’ve logged into a virtual desktop, you need to be able to (a) locate the CMIP data of interest that’s already been downloaded to the NCI data library, and (b) find out if there’s data of interest available elsewhere on the international Earth System Grid. In the case of CMIP5, Paola Petrelli (with help from the rest of the Computational Modelling Support team at the ARCCSS) has developed an excellent package called ARCCSSive that does both these things. For data located elsewhere on the grid, it also gives you the option of automatically sending a request to Paola for the data to be downloaded to the NCI data library. (They also have a great help channel on Slack if you get stuck and have questions.)

Developing and maintaining a package like ARCCSSive is no trivial task, particularly as the Earth System Grid Federation (ESGF) continually shift the goalposts by tweaking and changing the way the data is made available. In my opinion, one of the highest priority tasks for CMIP6 would be to develop and maintain an ARCCSSive-like tool that researchers can use for data lookup and download requests.


3. A way to systematically report and handle errors in the data

Before a data file is submitted to a CMIP project, it is supposed to have undergone a series of checks to ensure that the data values are reasonable (e.g. nothing crazy like a negative rainfall rate) and that the metadata meets community agreed standards. Despite these checks, data errors and metadata inconsistencies regularly slip through the cracks and many hours of research time is spent weeding out and correcting these issues. For CMIP5, there is a process (I think) for notifying the relevant modelling group (via the ESGF maybe?) of an error you’ve found, but it will be many months (if ever) before a file gets corrected and re-issued. For easy-to-fix errors, researchers will therefore often generate a fixed file (which is only available in their personal directories on the NCI system) and then move on with their analysis.

The obvious problem with this sequence is that the original file hasn’t been flagged as erroneous (and no details of how to fix it archived), which means the next researcher who comes along will experience the same problem all over again. The big improvement I think we can make between CMIP5 and CMIP6 is a community effort to flag erroneous files, share suggested fixes and ultimately provide temporary corrected data files until the originals are re-issued. This is something the Australian community has talked about for CMIP5, but the farthest we got was a wiki that is not widely used. (Paola has also added warning/errata functionality to the ARCCSSive package so that users can filter out bad data.)

In an ideal world, the ESGF would coordinate this effort. I’m imagining a GitHub page where CMIP6 users from around the world could flag data errors and for simple cases submit code that fixes the problem. A group of global maintainers could then review these submissions, run accepted code on problematic data files and provide a “corrected” data collection for download. As part of the ESGF, the NCI could push for the launch of such an initiative. If it turns out that the ESGF is unwilling or unable, NCI could facilitate a similar process just for Australia (i.e. community fixes for the CMIP data that’s available in the NCI data libary).


4. Community maintained code for common tasks

Many Australian researchers perform the same CMIP data analysis tasks (e.g. calculate the Nino 3.4 index from sea surface temperature data or the annual mean surface temperature over Australia), which means there’s a fairly large duplication of effort across the community. To try and tackle this problem, computing support staff from the Bureau of Meteorology and CSIRO launched the CWSLab workflow tool, which was an attempt to get the climate community to share and collaboratively develop code for these common tasks. I actually took a one-month break during my PhD to work on that project and even waxed poetic about it in a previous post. I still love the idea in principle (and commend the BoM and CSIRO for making their code openly available), but upon reflection I feel like it’s a little ahead of its time. The broader climate community is still coming to grips with the idea of managing its personal code with a version control system; it’s a pretty big leap to utilising and contributing to an open source community project on GitHub, and that’s before we even get into the complexities associated with customising the VisTrails workflow management system used by the CWSLab workflow tool. I’d much prefer to see us aim to get a simple community error handling process off the ground first, and once the culture of code sharing and community contribution is established the CWSLab workflow tool could be revisited.


In summary, as we look towards CMIP6 in Australia, here’s how things look from the perspective of a scientist who’s been wrangling CMIP data for years:

  1. The NCI virtual desktops are ready to go and fit for purpose
  2. The ARCCSS software for locating and downloading CMIP5 data is fantastic. Developing and maintaining a similar tool for CMIP6 should be a high priority.
  3. The ESGF (or failing that, NCI) could lead a community-wide effort to identify and fix bogus CMIP data files
  4. A community maintained code repository for common data processing tasks (i.e. the CWSLab workflow tool) is an idea that is probably ahead of its time
April 11, 2017 / Damien Irving

Attention scientists: Frustrated with politics? Pick a party and get involved.

The March for Science is coming up on 22 April, so I’m taking a quick detour from my regular focus on research best practice. I’ve been invited to speak at the march in Hobart, Australia, so I thought I’d share what I’m going to say…

In today’s world of alternative facts and hyper-partisan public debate, there are growing calls for scientists to get involved in politics. This might take the form of speaking out on your area of expertise, participating in a non-partisan advocacy group and/or getting involved with a political party. If you think the latter sounds like the least attractive option of the three, you’re not alone. Membership of political parties has been in decline for years, to the point where many sporting clubs have more members. While this might sound like a good reason not to join a political party, I’ve found that it means your involvement can have a bigger impact than ever before.

A little over twelve-months ago, I moved to Hobart to take up a postdoctoral fellowship. As part of a new start in a new town, I decided to get actively involved with the Tasmanian Greens. Fast forward a year and I’m now the Convenor of the Denison Branch of the Party. Bob Brown (the father of the environment movement in Australia) started his political career as a Member for Denison in the Tasmanian Parliament and our current representative (Cassy O’Connor MP) is the leader of the Tasmanian Greens, so it’s been a fascinating and humbling experience so far.

Upon taking the plunge into politics, the first thing that struck me was the overwhelming reliance on volunteers. The Tasmanian Greens have very few staff, which means there is an infinite number of ways for volunteers to get involved. If your motivation lies in changing party policy in your area of expertise, you can take a lead role in re-writing that policy and campaigning for the support of the membership. If you’re happy with party policy and want to help achieve outcomes, your professional skills can definitely be put to good use. My data science skills have been in particularly high demand, and I’m now busily involved in managing our database of members and supporters. Besides this practical contribution, the experience has also been great for my mental wellbeing. Rather than simply despair at the current state of politics (which most often means ranting to like-minded friends and followers on social media), I now have an outlet for actively improving the situation.

If you’re a scientist (or simply someone who cares about the importance of knowledge, evidence and objectivity in the political process) and aren’t currently involved with a political party, I’d highly recommend giving it a go. Any party would benefit from the unique knowledge and skills you bring to the table. As with most volunteer experiences, you’ll also get out a whole lot more than you put in.

There are going to be over 400 marches around the world, so check the map and get along to the one nearest you (or better still, contact the organiser and offer to speak).

February 15, 2017 / Damien Irving

The research police

You know who I’m talking about. I’m sure every research community has them. Those annoying do-gooders who constantly advocate for things to be done the right way. When you’re trying to take a shortcut, it’s their nagging voice in the back of your mind. You appreciate that what they’re saying is important, but with so much work and so little time, you don’t always want to hear it. Since I’m fond of creating lists on this blog, here’s my research police of the weather, ocean and climate sciences:



Dan Wilks is a widely regarded statistics guru in the atmospheric sciences. He is the author of the most clearly written statistics textbook I’ve ever come across, as well as great articles such as this recent essay in BAMS, which is sure to make you feel bad if you’ve ever plotted significance stippling.


Data visualisation

Ed Hawkins’ climate spiral visualisation received worldwide media coverage in 2016 (and even featured in the opening ceremony of the Rio Olympics). He makes the list of research police due to his end the rainbow campaign, which advocates for the use of more appropriate colour scales in climate science.



David Schultz is the Chief Editor of Monthly Weather Review and has authored well over 100 research articles, but is probably best known as the “Eloquent Science guy.” His book and blog are a must read for anyone wanting to improve their academic writing, reviewing and speaking.



Unfortunately I’m going to have to self-nominate here, as I’ve been a strong advocate for publishing reproducible computational results for a number of years now (see related post and BAMS essay). To help researchers do this, I’ve taught at over 20 Software Carpentry workshops and I’m the lead author of their climate-specific lesson materials.


If I’ve missed any other research police, please let me know in comments!

January 11, 2017 / Damien Irving

Need help with reproducible research? These organisations have got you covered.

The reproducibility crisis in modern research is a multi-faceted problem. If you’re working in the life sciences, for instance, experimental design and poor statistical power are big issues. For the weather, ocean and climate sciences, the big issue is code and software availability. We don’t document the details of the code and software used to analyse and visualise our data, which means it’s impossible to interrogate our methods and reproduce our results.

(For the purposes of this post, research “software” is something that has been packaged and released for use by the wider community, whereas research “code” is something written just for personal use. For instance, I might have written some code to perform and plot an EOF analysis, which calls and executes functions from the eofs software package that is maintained by Andrew Dawson at Oxford University.)

Unbeknown to most weather, ocean and climate scientists, there are a number of groups out there that want to help you make your work more reproducible. Here’s a list of the key players and what they’re up to…


Software Sustainability Institute (SSI)

The SSI is the go-to organisation for people who write and maintain scientific software. They provide training and support, advocate for formal career paths for scientific software developers and manage the Journal of Open Research Software, where you can publish the details of your software so that people can cite your work. They focus mainly on researchers in the UK, so it’s my hope that organisations like SSI will start popping up in other countries around the world.



The OntoSoft project in the US has a bit of overlap with the SSI (e.g. they’re working on “software commons” infrastructure where people can submit their geoscientific software so that it can be searched and discovered by others), but in addition their Geoscientific Paper of the Future (GPF) initiative has been looking at the broader issue of how researchers should go about publishing the details of the digital aspects of their research (i.e. data, code, software and provenance/workflow). In a special GPF issue of Earth and Space Science, researchers from a variety of geoscience disciplines share their experiences in trying to document their digital research methods. The lead paper from that issue gives a fantastic overview of the options available to researchers. (My own work in this area gives a slightly more practical overview but in general covers many of the same ideas.)


Software Carpentry

The global network of volunteer Software Carpentry instructors run hundreds of two-day workshops around the world each year, teaching the skills needed to write reusable, testable and ultimately reproducible code (i.e. to do the things suggested by the GPF). Their teaching materials have been developed and refined for more than a decade and every instructor undergoes formal training, which means you won’t find a better learning experience anywhere. To get a workshop happening at your own institution, you simply need to submit a request at their website. They’ll then assist with finding local instructors and all the other logistics that go along with running a workshop. A sibling organisation called Data Carpentry has recently been launched, so it’s also worth checking to see if their more discipline-specific, data-centric lessons would be a better fit.


Mozilla Science Lab

Once you’ve walked out of a two-day Software Carpentry workshop, it can be hard to find ongoing support for your coding. The best form of support usually comes from an engaged and well connected local community, so the Mozilla Science Lab assists researchers in forming and maintaining in-person study groups. If there isn’t already a study group in your area, I’d highly recommend their study group handbook. It has a bunch of useful advice and resources for getting one started, plus they periodically run online orientation courses to go through the handbook content in detail.


Hopefully one or more of those organisations will be useful in your attempts to make your work more reproducible – please let me know in comments if there’s other groups/resources that I’ve missed!


October 4, 2016 / Damien Irving

The weather/climate Python stack

It would be an understatement to say that Python has exploded onto the data science scene in recent years. PyCon and SciPy conferences are held somewhere in the world every few months now, at which loads of new and/or improved data science libraries are showcased to the community. When the videos from these conferences are made available online (which is almost immediately at, I’m always filled with a mixture of joy and dread. The ongoing rapid development of new libraries means that data scientists are (hopefully) continually able to do more and more cool things with less and less time and effort, but at the same time it can be difficult to figure out how they all relate to one another. To assist in making sense of this constantly changing landscape, this post summarises the current state of the weather and climate Python software “stack” (i.e. the collection of libraries used for data analysis and visualisation). My focus is on libraries that are widely used and that have good (and likely long-term) support, but I’m happy to hear of others that you think I might have missed!


The weather/climate Python stack.



The dashed box in the diagram represents the core of the stack, so let’s start our tour there. The default library for dealing with numerical arrays in Python is NumPy. It has a bunch of built in functions for reading and writing common data formats like .csv, but if your data is stored in netCDF format then the default library for getting data into/out of those files is netCDF4.

Once you’ve read your data in, you’re probably going to want to do some statistical analysis. The NumPy library has some built in functions for calculating very simple statistics (e.g. maximum, mean, standard deviation), but for more complex analysis (e.g. interpolation, integration, linear algebra) the SciPy library is the default.

The NumPy library doesn’t come with any plotting capability, so if you want to visualise your NumPy data arrays then the default library is matplotlib. As you can see at the matplotlib gallery, this library is great for any simple (e.g. bar charts, contour plots, line graphs), static (e.g. .png, .eps, .pdf) plots. The cartopy library provides additional functionality for common map projections, while Bokeh allows for the creation of interactive plots where you can zoom and scroll.

While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their highly flexible, all-purpose nature means relatively common/simple tasks can often require quite a bit of work (i.e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These additional libraries aren’t as flexible – they can’t do everything like the core stack can – but they can do common tasks with far less effort…


Generic additions

Let’s first consider the generic additional libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index (as is required with NumPy), the actual row and column headings can be used. That means Fred’s height could be obtained from a medical dataset by asking for data[‘Fred’, ‘height’], rather than having to remember the numeric index corresponding to that person and characteristic. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with SciPy and matplotlib, greatly simplifies the code development process (read: less lines of code).

One of the limitations of pandas is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available (which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude (e.g. 20 South), longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, makes the xarray data array far easier to deal with than the NumPy array. (As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.) With the recent announcement of funding for the Pangeo project, the future is bright in terms of further development of xarray for the purposes of big data analysis in the geosciences.


Discipline-specific additions

While the xarray library is a good option for those working in the weather and climate sciences (especially those dealing with large multi-dimensional arrays from model simulations), the team of software developers at the MetOffice have taken a different approach to building on top of the core stack. Rather than striving to make their software generic (xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their Iris library are dealing with weather/climate data. Doing this allows them to make common weather/climate tasks super quick and easy, and it also means they have added lots of useful functions specific to weather/climate science.

In terms of choosing between xarray and Iris, some people like the slightly more weather/climate-centric experience offered by Iris, while others don’t like the restrictions that places on their work and prefer the generic xarray experience (e.g. to use Iris your netCDF data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience.


Simplifying data exploration

While the plotting functionality associated with xarray and Iris speeds up the process of visually exploring data (as compared to matplotlib), there’s still a fair bit of messing around involved in tweaking the various aspects of a plot (e.g. colour schemes, plot size, labels, map projections, etc). This tweaking burden is an issue across all data science fields and programming languages, so developers of the latest generation of visualisation tools are moving towards something called declarative visualisation. The basic concept is that the user simply has to describe the characteristics of their data, and then the software figures out the optimal way to visualise it (i.e. it makes all the tweaking decisions for you).

The two major Python libraries in the declarative visualisation space are HoloViews and Altair. The former (which has been around much longer) uses matplotlib or Bokeh under the hood, which means it allows for the generation of static or interactive plots. Since HoloViews doesn’t have support for geographic plots, GeoViews has been created on top of it (which incorporates cartopy and can handle Iris or xarray data arrays).


Sub-discipline-specific libraries

So far we’ve considered libraries that do general, broad-scale tasks like data input/output, common statistics, visualisation, etc. Given their large user base, these libraries are usually written and supported by large companies (e.g. Continuum Analytics supports conda, Bokeh and HoloViews/Geoviews), large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) or the wider PyData community (e.g. pandas, xarray). Within each sub-discipline of weather and climate science, individuals and research groups take these libraries and apply them to their very specific data analysis tasks. Increasingly, these individuals and groups are formally packaging and releasing their code for use within their community. For instance, Andrew Dawson (an atmospheric scientist at Oxford) does a lot of EOF analysis and manipulation of wind data, so he has released his eofs and windspharm libraries (which are able to handle data arrays from NumPy, Iris or xarray). Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility have released their Python ARM Radar Toolkit (Py-ART) for analysing weather radar data, and a similar story is true for MetPy. It would be impossible to list all the sub-discipline-specific libraries in this post, but the PyAOS community is an excellent resource if you’re trying to find out what’s available in your area of research.


Installing the stack

While the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C, because speed is important when data arrays get really large. Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite scientific Python libraries working together used to be problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released a package manager called conda, which is able to handle non-Python dependencies. The documentation for almost all modern scientific Python packages will suggest that you use conda for installation.


Navigating the stack

All of the additional libraries discussed in this post essentially exist to hide the complexity of the core libraries (in software engineering this is known as abstraction). Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib. GeoViews was built to hide some of the complexity of Iris, cartopy and Bokeh. So if you want to start exploring your data, start at the top right of the stack and move your way down and left as required. If GeoViews doesn’t have quite the right functions for a particular plot that you want to create, drop down a level and use some Iris and cartopy functions. If Iris doesn’t have any functions for a statistical procedure that you want to apply, go back down another level and use SciPy. By starting at the top right and working your way back, you’ll ensure that you never re-invent the wheel. Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4) for extracting the metadata contained within a netCDF file, for instance, only to find that Iris automatically keeps this information upon reading a file. In this way, a solid working knowledge of the scientific Python stack can save you a lot of time and effort.


June 16, 2016 / Damien Irving

How to write a reproducible paper

As mentioned in a previous call for volunteers, I dedicated part of my PhD to proposing a solution to the reproducibility crisis in modern computational research. In a nutshell, the crisis has arisen because most papers do not make the data and code underpinning their key findings available, which means it is impossible to replicate and verify the results. A good amount progress has been made with respect to documenting and publishing data in recent years, so I specifically focused on software/code. I looked at many aspects of the issue including the reasons why people don’t publish their code, computational best practices and journal publishing standards, much of which is covered in an essay I published with the Bulletin of the American Meteorological Society. That essay is an interesting read if you’ve got the time (in my humble opinion!), but for this post I wanted to cut to the chase and outline how one might go about writing a reproducible paper.

On the surface, the reproducible papers I wrote as part of my PhD (i.e. as a kind of proof of concept; see here and here) look similar to any other paper. The only difference is a short computation section placed within the traditional methods section of the paper. That computation section begins with a brief, high-level summary of the major software packages that were used, with citations provided to any papers dedicated to documenting that software. Authors of scientific software are increasingly publishing overviews of their software in journals like the Journal of Open Research Software and Journal of Open Source Software, so it’s important to give them the academic credit they deserve.

Following this high level summary, the computation section points the reader to three key supplementary items:

  1. A more detailed description of the software used
  2. A copy of any code written by the authors to produce the key results
  3. A description of the data processing steps taken in producing each key result (i.e. a step-by-step account of how the software and code were actually used)

I’ll look at each of these components in turn, considering both the bare minimum you’d need to do in order be reproducible and the extra steps you could take to make things easier for the reader.


1. Software description

While the broad software overview provided in the computation section is a great way to give academic credit to those who write scientific software, it doesn’t provide sufficient detail to recreate the software environment used in the study. In order to provide this level of detail, the bare minimum you’d need to do is follow the advice of the Software Sustainability Institute. They suggest documenting the name, version number, release date, institution and DOI or URL of each software package, which could be included in a supplementary text file.

While such a list means your environment is now technically reproducible, you’ve left it up to the reader to figure out how to get all those software packages and libraries installed and playing together nicely. In some cases this is fine (e.g. it might be easy enough for a reader to install the handful of MATLAB toolboxes you used), but in other cases you might want to save the reader (and your future self) the pain of software installation by making use of a tool that can automatically install a specified software environment. The simplest of these is conda, which I discussed in detail in a previous post. It is primarily used for the management of Python packages, but can be used for other software as well. I install my complete environment with conda, which includes non-Python command line utilities like the Climate Data Operators, and then make that environment openly available on my channel at Beyond conda there are more complex tools like Docker and Nix, which can literally install your entire environment (down to the precise operating system) on a different machine. There’s lots of debate (e.g. here) about the potential and suitability of these tools as a solution to reproducible research, but it’s fair to say that their complexity puts them out of reach for most weather and climate scientists.


2. Code

The next supplementary item you’ll need to provide is a copy of the code you wrote to execute those software packages. For a very simple analysis that might consist of a single script for each key result (e.g. each figure), but it’s more likely to consist of a whole library/collection of code containing many interconnected scripts. The bare minimum you’d need to do to make your paper reproducible is to make an instantaneous snapshot of that library (i.e. at the time of paper submission or acceptance) available as supplementary material.

As with the software description, this bare minimum ensures your paper is reproducible, but it leaves a few problems for both you and the reader. The first is that in order to provide an instantaneous snapshot, you’d need to make sure that all your results were produced with the latest version of your code library. In many cases this isn’t practical (e.g. Figure 3 might have been generated five months ago and you don’t want to re-run the whole time consuming process), so you’ll probably want to manage your code library with a version control system like Git, Subversion or Mercurial, so you can easily access previous versions. If you’re using a version control system you might as well hook it up to an external hosting service like GitHub or Bitbucket, so you’ve got your code backed up elsewhere. If you make your GitHub or Bitbucket repository publicly accessible then readers can view the very latest version of your code (in case you’ve made any improvements since publishing the paper), as well as submit proposed updates or bug fixes via the useful interface (which includes commenting, chat and code viewing features) that those websites provide.


3. Data processing steps

A code library and software description on their own are not much use to a reader; they also need to know how that code was used in generating the results presented. The simplest way to do this is to make your scripts executable at the command line, so you can then keep a record of the series of command line entries required to produce a given result. Two of the most well known data analysis tools in the weather and climate sciences – the netCDF Operators (NCO) and Climate Data Operators (CDO) – do exactly this, storing that record in the global attributes of the output netCDF file. I’ve written a Software Carpentry lesson showing how to generate these records yourself, including keeping track of the corresponding version control revision number, so you know exactly which version of the code was executed.

As before, while these bare minimum log files ensure that your workflow is reproducible, they are not particularly comprehensible. Manually recreating workflows from these log files would be a tedious and time consuming process, even for just moderately complex analyses. To make things a little easier for the reader (and your future self), it’s a good idea to include a README file in your code library explaining the sequence of commands required to produce common/key results. You might also provide a Makefile that automatically builds and executes common workflows (Software Carpentry have a nice lesson on that too). Beyond that the options get more complex, with workflow management packages like VisTrails providing a graphical interface that allows users to drag and drop the various components of their workflow.


Where to put these supplementary materials?

Following the steps above, you’ll be left with a text file (or environment.yml file exported from a conda environment) describing your software environment, a copy of your version controlled code library and various log files, README files and/or make files that describe your data processing steps. While you will have hopefully made some of these items openly available via and GitHub, those sites don’t guarantee persistent long term storage (e.g. if you changed the name of your GitHub repo any URLs in your paper would be broken). You therefore need to host your supplementary files with the journal, your home institution (e.g. Max Plank Institute for Meteorology provide a persistent digital storage facility for staff and students) or a site like Figshare or Zenodo. The latter sites have been specifically setup for archiving the “long tail” of research papers (e.g. supplementary figures, tables, code and data) and issue a DOI to those materials, meaning they guarantee a certain level of security, longevity and access (e.g. check out



In order to ensure that your research is reproducible, you need to add a short computation section to your papers. That section should cite the major software packages used in your work, before linking to three key supplementary items: (1) a description of your software environment, (2) a copy of your code library and (3) details of the data processing steps taken in producing each key result. The bare minimum you’d need to do for these supplementary items is summarised in the table below, along with extension options that will make life easier for both the reader and your future self.

If you can think of other extension options to include in this summary, please let me know in the comments below!


Minimum Extension
Software description Document the name, version number, release date, institution and DOI or URL of each software package Provide a conda environment.yml file; use Docker / Nix
Code library Provide a copy of your code library Version control that library and host it in a publicly accessible code repository on GitHub or Bitbucket
Processing steps Provide a separate log file for each key result Include a README file and possibly Makefile in code library; provide output (e.g. a flowchart) from a workflow management system like Vistrails
April 13, 2016 / Damien Irving

Keeping up with Continuum

I’m going to spend the next few hundred characters gushing over a for-profit company called Continuum Analytics. I know that seems a little weird for a blog that devotes much of its content to open science, but stick with me. It turns out that if you want to keep up with the latest developments in data science, then you need to be on top of what this company is doing.

If you’ve heard the name Continuum Analytics before, it’s probably in relation to a widely used Python distribution called Anaconda. In a nutshell, Travis Oliphant (who was the primary creator of NumPy) and his team at Continuum developed Anaconda, gave it away for free to the world, and then built a thriving business around it. Continuum makes its money by providing training, consultation and support to paying customers who use Anaconda (and who are engaged in data science/analytics more generally), in much the same way that RedHat provides support to customers using Linux.

The great thing about companies like RedHat and Continuum is that because their business fundamentally depends on open source software, they contribute a great deal back to the open source community. If you’ve ever been to a SciPy conference (something I would highly recommend), you would have noticed that there’s always a few presentations from Continuum staff, whose primary job appears to be to simply work on the coolest open source projects going around. What’s more, the company seems to have a knack for supporting projects that make life much, much easier for regular data scientists (i.e. people who know how to analyse data in Python, but for which things like system administration and web programming are beyond them). For instance, the projects they support (see the full list here) can help you install software without having to know anything about system admin (conda), create interactive web visualisations without knowing Javascript (bokeh), process data arrays larger than the available RAM without knowing anything about multi-core parallel processing (dask) and even speed up your code without having to resort to a low level language (numba).

Of these examples, the most important achievement (in my opinion) is the conda package manager, which I’ve talked about previously. Once you’ve installed either Anaconda (which comes with 75 of the most popular Python data science libraries already installed) or Miniconda (which essentially just comes with conda and nothing else), you can then use conda to install pretty much any library you’d like with one simple command line entry. That’s right. If you want pandas, just type conda install pandas and it will be there, along with its dependencies, playing nicely with all your other libraries. If you decide you’d like to access pandas from the jupyter notebook, just type conda install jupyter and you’re done. There are about 330 libraries available directly like this and because they are maintained by the Continuum team, they are guaranteed to work.

While this is all really nice, other Python distributions like Canopy also come with a package manager for installing widely used libraries. What sets conda apart is the ease with which the wider community can contribute. If you’ve written a library that you’d like people to be able to install easily, you can write an associated installation package and post it at Anaconda Cloud. For instance, Andrew Dawson (a climate scientist at Oxford) has written eofs, a Python library for doing EOF analysis. Rather than have users of his software mess around installing the dependencies for eofs, he has posted a conda package for eofs at his channel on Anaconda Cloud. Just type conda install -c eofs and you’re done; it will install eofs and all its dependencies for you. Some users (e.g. like the US Integrated Ocean Observing System) even go a step further and post packages for a wide variety of Python libraries that are relevant to the work they do. This vast archive of community contributed conda packages means there isn’t a single library I use in my daily work that isn’t available via either conda install or Anaconda Cloud. In fact, a problem I often face is that there is more than one installation package for a particular library (i.e. which one do I use? And if I get an error, where should I ask for assistance?). To solve this problem, conda-forge has recently been launched. The idea is that it will house the lone instance of every community contributed package, in order to (a) avoid duplication of effort, and (b) make it clear where questions (and suggested updates / bug fixes) should be directed.

The final mind blowing feature of conda is the ease with which you can manage different environments. Rather than lump all your Python libraries in together, it can be nice to have a clean and completely separate environment for each discrete aspect of the work you do (e.g. I have a separate environments for my ocean data analysis, atmosphere data analysis and for testing new libraries). This will sound familiar to anyone who has used virtualenv, but again the value of conda environments is the ease with which the community can share. As an example, I’ve shared the details of my ocean data analysis environment (right down to the precise version of every single Python library). I started by exporting the details of the environment by typing conda env export -n ocean-environment -f blog-example, before posting it to my channel at Anaconda Cloud (conda env upload -f blog-example). Anyone can now come along and recreate that environment on their own computer by typing conda env create damienirving/blog-example (and then source activate blog-example to get it running). This is obviously huge for the reproducibility of my work, so for my next paper I’ll be posting a corresponding conda environment to Anaconda Cloud.

If you want to know more about Continuum, I highly recommend this Talk Python To Me podcast with Travis Oliphant.