Skip to content
October 4, 2016 / Damien Irving

The weather/climate Python stack

It would be an understatement to say that Python has exploded onto the data science scene in recent years. PyCon and SciPy conferences are held somewhere in the world every few months now, at which loads of new and/or improved data science libraries are showcased to the community. When the videos from these conferences are made available online (which is almost immediately at pyvideo.org), I’m always filled with a mixture of joy and dread. The ongoing rapid development of new libraries means that data scientists are (hopefully) continually able to do more and more cool things with less and less time and effort, but at the same time it can be difficult to figure out how they all relate to one another. To assist in making sense of this constantly changing landscape, this post summarises the current state of the weather and climate Python software “stack” (i.e. the collection of libraries used for data analysis and visualisation). My focus is on libraries that are widely used and that have good (and likely long-term) support, but I’m happy to hear of others that you think I might have missed!

python_climate_stack

The weather/climate Python stack.

 

Core

The dashed box in the diagram represents the core of the stack, so let’s start our tour there. The default library for dealing with numerical arrays in Python is NumPy. It has a bunch of built in functions for reading and writing common data formats like .csv, but if your data is stored in netCDF format then the default library for getting data into/out of those files is netCDF4.

Once you’ve read your data in, you’re probably going to want to do some statistical analysis. The NumPy library has some built in functions for calculating very simple statistics (e.g. maximum, mean, standard deviation), but for more complex analysis (e.g. interpolation, integration, linear algebra) the SciPy library is the default.

The NumPy library doesn’t come with any plotting capability, so if you want to visualise your NumPy data arrays then the default library is matplotlib. As you can see at the matplotlib gallery, this library is great for any simple (e.g. bar charts, contour plots, line graphs), static (e.g. .png, .eps, .pdf) plots. The cartopy library provides additional functionality for common map projections, while Bokeh allows for the creation of interactive plots where you can zoom and scroll.

While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their highly flexible, all-purpose nature means relatively common/simple tasks can often require quite a bit of work (i.e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These additional libraries aren’t as flexible – they can’t do everything like the core stack can – but they can do common tasks with far less effort…

 

Generic additions

Let’s first consider the generic additional libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index (as is required with NumPy), the actual row and column headings can be used. That means Fred’s height could be obtained from a medical dataset by asking for data[‘Fred’, ‘height’], rather than having to remember the numeric index corresponding to that person and characteristic. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with SciPy and matplotlib, greatly simplifies the code development process (read: less lines of code).

One of the limitations of pandas is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available (which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude (e.g. 20 South), longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, makes the xarray data array far easier to deal with than the NumPy array. (As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.)

 

Discipline-specific additions

While the xarray library is a good option for those working in the weather and climate sciences (especially those dealing with large multi-dimensional arrays from model simulations), the team of software developers at the MetOffice have taken a different approach to building on top of the core stack. Rather than striving to make their software generic (xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their Iris library are dealing with weather/climate data. Doing this allows them to make common weather/climate tasks super quick and easy, and it also means they have added lots of useful functions specific to weather/climate science.

In terms of choosing between xarray and Iris, some people like the slightly more weather/climate-centric experience offered by Iris, while others don’t like the restrictions that places on their work and prefer the generic xarray experience (e.g. to use Iris your netCDF data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience.

 

Simplifying data exploration

While the plotting functionality associated with xarray and Iris speeds up the process of visually exploring data (as compared to matplotlib), there’s still a fair bit of messing around involved in tweaking the various aspects of a plot (e.g. colour schemes, plot size, labels, map projections, etc). This tweaking burden is an issue across all data science fields and programming languages, so developers of the latest generation of visualisation tools are moving towards something called declarative visualisation. The basic concept is that the user simply has to describe the characteristics of their data, and then the software figures out the optimal way to visualise it (i.e. it makes all the tweaking decisions for you).

The two major Python libraries in the declarative visualisation space are HoloViews and Altair. The former (which has been around much longer) uses matplotlib or Bokeh under the hood, which means it allows for the generation of static or interactive plots. Since HoloViews doesn’t have support for geographic plots, GeoViews has been created on top of it (which incorporates cartopy and can handle Iris or xarray data arrays).

 

Sub-discipline-specific libraries

So far we’ve considered libraries that do general, broad-scale tasks like data input/output, common statistics, visualisation, etc. Given their large user base, these libraries are usually written and supported by large companies (e.g. Continuum Analytics supports conda, Bokeh and HoloViews/Geoviews), large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) or the wider PyData community (e.g. pandas, xarray). Within each sub-discipline of weather and climate science, individuals and research groups take these libraries and apply them to their very specific data analysis tasks. Increasingly, these individuals and groups are formally packaging and releasing their code for use within their community. For instance, Andrew Dawson (an atmospheric scientist at Oxford) does a lot of EOF analysis and manipulation of wind data, so he has released his eofs and windspharm libraries (which are able to handle data arrays from NumPy, Iris or xarray). Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility have released their Python ARM Radar Toolkit (Py-ART) for analysing weather radar data, and a similar story is true for MetPy. It would be impossible to list all the sub-discipline-specific libraries in this post, but the PyAOS community is an excellent resource if you’re trying to find out what’s available in your area of research.

 

Installing the stack

While the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C, because speed is important when data arrays get really large. Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite scientific Python libraries working together used to be problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released a package manager called conda, which is able to handle non-Python dependencies. The documentation for almost all modern scientific Python packages will suggest that you use conda for installation.

 

Navigating the stack

All of the additional libraries discussed in this post essentially exist to hide the complexity of the core libraries (in software engineering this is known as abstraction). Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib. GeoViews was built to hide some of the complexity of Iris, cartopy and Bokeh. So if you want to start exploring your data, start at the top right of the stack and move your way down and left as required. If GeoViews doesn’t have quite the right functions for a particular plot that you want to create, drop down a level and use some Iris and cartopy functions. If Iris doesn’t have any functions for a statistical procedure that you want to apply, go back down another level and use SciPy. By starting at the top right and working your way back, you’ll ensure that you never re-invent the wheel. Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4) for extracting the metadata contained within a netCDF file, for instance, only to find that Iris automatically keeps this information upon reading a file. In this way, a solid working knowledge of the scientific Python stack can save you a lot of time and effort.

 

Advertisements

8 Comments

Leave a Comment
  1. Johan Swanljung / Oct 4 2016 17:57

    Great post. Just nitpicking, really, but cartopy doesn’t build on basemap; it is a complete replacement, so really nothing to the right in your stack picture depends on it. Since I started using cartopy I haven’t touched basemap. Iris includes fantastic cartopy integration (not surprising since they are, as you say, developed in parallel) but cartopy is a big improvement on basemap even without Iris. I would also say that conda and pip exist in parallel, not on different levels of the stack. Conda is a separate system (although they work well together) and handles both python and non-python dependencies. I don’t find myself using pip very often anymore, but it is nice that it works in the rare cases where a package is not yet available through conda.

    • Damien Irving / Oct 5 2016 08:56

      Hi, Johan. Thank you for this comment – I had just assumed Cartopy was built on basemap! So really my core should read: pip, netCDF4, numpy, scipy, matplotlib, cartopy, bokeh. There’s no reason any weather/climate scientist would use basemap, as it doesn’t offer any functionality that cartopy doesn’t have. Is that correct?

      • Johan Swanljung / Oct 5 2016 17:35

        I can’t swear that cartopy covers every use case that basemap did, but I haven’t run into any limitations. Cartopy is much simpler to use, is actively developed, and doesn’t suffer from the infamous dateline problem (http://stackoverflow.com/questions/17423692/plotting-at-boundaries-using-matplotlib-basemap). Iris integration is great, but xarray and geoviews also base their plotting on cartopy. So yeah, no reason to use basemap unless your old code requires it.

        Have you been using geoviews? I really like the idea and it looks like it will be very useful for things I’m doing, but it’s a very young project and had some dealbreaking bugs when I first played with it.

  2. Johan Swanljung / Oct 4 2016 18:05

    I’ve found your posts on workflow automation and reproducibility really inspiring, by the way. Are you aware of the Lancet package? https://ioam.github.io/lancet/ It’s a workflow/reproducibility tool developed by some of the same people that are doing holoviews/geoviews.

  3. Yan Li / Feb 25 2017 08:54

    This article is great and extremely helpful! I learned a lot of new cool stuffs from it since last I read your toolbox. Thanks! Please keep updating it.

  4. Yan Li / Apr 21 2017 01:30

    What do you think about python 2.x and 3.x? Do you think we should use 3.x?

    • Damien Irving / Apr 21 2017 09:52

      All those libraries listed in my stack are Python 3 compatible, so I’d definitely suggest that you should be using Python 3. If you need to convert some of your existing code from Python 2 to Python 3, the 2to3 program that comes installed with Python 3 by default is super useful:
      https://docs.python.org/3.0/library/2to3.html

Trackbacks

  1. A PyAOS masterclass | Dr Climate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: