The weather/climate Python stack
It would be an understatement to say that Python has exploded onto the data science scene in recent years. PyCon and SciPy conferences are held somewhere in the world every few months now, at which loads of new and/or improved data science libraries are showcased to the community. When the videos from these conferences are made available online (which is almost immediately at pyvideo.org), I’m always filled with a mixture of joy and dread. The ongoing rapid development of new libraries means that data scientists are (hopefully) continually able to do more and more cool things with less and less time and effort, but at the same time it can be difficult to figure out how they all relate to one another. To assist in making sense of this constantly changing landscape, this post summarises the current state of the weather and climate Python software “stack” (i.e. the collection of libraries used for data analysis and visualisation). My focus is on libraries that are widely used and that have good (and likely long-term) support, but I’m happy to hear of others that you think I might have missed!
The dashed box in the diagram represents the core of the stack, so let’s start our tour there. The default library for dealing with numerical arrays in Python is NumPy. It has a bunch of built in functions for reading and writing common data formats like .csv, but if your data is stored in netCDF format then the default library for getting data into/out of those files is netCDF4.
Once you’ve read your data in, you’re probably going to want to do some statistical analysis. The NumPy library has some built in functions for calculating very simple statistics (e.g. maximum, mean, standard deviation), but for more complex analysis (e.g. interpolation, integration, linear algebra) the SciPy library is the default.
The NumPy library doesn’t come with any plotting capability, so if you want to visualise your NumPy data arrays then the default library is matplotlib. As you can see at the matplotlib gallery, this library is great for any simple (e.g. bar charts, contour plots, line graphs), static (e.g. .png, .eps, .pdf) plots. The cartopy library provides additional functionality for common map projections, while Bokeh allows for the creation of interactive plots where you can zoom and scroll.
While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their highly flexible, all-purpose nature means relatively common/simple tasks can often require quite a bit of work (i.e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These additional libraries aren’t as flexible – they can’t do everything like the core stack can – but they can do common tasks with far less effort…
Let’s first consider the generic additional libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index (as is required with NumPy), the actual row and column headings can be used. That means Fred’s height could be obtained from a medical dataset by asking for data[‘Fred’, ‘height’], rather than having to remember the numeric index corresponding to that person and characteristic. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with SciPy and matplotlib, greatly simplifies the code development process (read: less lines of code).
One of the limitations of pandas is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available (which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude (e.g. 20 South), longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, makes the xarray data array far easier to deal with than the NumPy array. (As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.)
While the xarray library is a good option for those working in the weather and climate sciences (especially those dealing with large multi-dimensional arrays from model simulations), the team of software developers at the MetOffice have taken a different approach to building on top of the core stack. Rather than striving to make their software generic (xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their Iris library are dealing with weather/climate data. Doing this allows them to make common weather/climate tasks super quick and easy, and it also means they have added lots of useful functions specific to weather/climate science.
In terms of choosing between xarray and Iris, some people like the slightly more weather/climate-centric experience offered by Iris, while others don’t like the restrictions that places on their work and prefer the generic xarray experience (e.g. to use Iris your netCDF data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience.
Simplifying data exploration
While the plotting functionality associated with xarray and Iris speeds up the process of visually exploring data (as compared to matplotlib), making minor tweaks to a plot or iterating over multiple time steps is still rather cumbersome. In an attempt to overcome this issue, a library called HoloViews was recently released. By using matplotlib and Bokeh under the hood, it allows for the generation of static or interactive plots where tweaking and iterating are super easy (especially in the Jupyter Notebook, which is where more and more people are doing their data exploration these days). Since HoloViews doesn’t have support for geographic plots, GeoViews has been created on top of it (which incorporates cartopy and can handle Iris or xarray data arrays).
So far we’ve considered libraries that do general, broad-scale tasks like data input/output, common statistics, visualisation, etc. Given their large user base, these libraries are usually written and supported by large companies (e.g. Continuum Analytics supports conda, Bokeh and HoloViews/Geoviews), large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) or the wider PyData community (e.g. pandas, xarray). Within each sub-discipline of weather and climate science, individuals and research groups take these libraries and apply them to their very specific data analysis tasks. Increasingly, these individuals and groups are formally packaging and releasing their code for use within their community. For instance, Andrew Dawson (an atmospheric scientist at Oxford) does a lot of EOF analysis and manipulation of wind data, so he has released his eofs and windspharm libraries (which are able to handle data arrays from NumPy, Iris or xarray). Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility have released their Python ARM Radar Toolkit (Py-ART) for analysing weather radar data, and a similar story is true for MetPy. It would be impossible to list all the sub-discipline-specific libraries in this post, but the PyAOS community is an excellent resource if you’re trying to find out what’s available in your area of research.
Installing the stack
While the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C, because speed is important when data arrays get really large. Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite scientific Python libraries working together used to be problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released a package manager called conda, which is able to handle non-Python dependencies. The documentation for almost all modern scientific Python packages will suggest that you use conda for installation.
Navigating the stack
All of the additional libraries discussed in this post essentially exist to hide the complexity of the core libraries (in software engineering this is known as abstraction). Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib. GeoViews was built to hide some of the complexity of Iris, cartopy and Bokeh. So if you want to start exploring your data, start at the top right of the stack and move your way down and left as required. If GeoViews doesn’t have quite the right functions for a particular plot that you want to create, drop down a level and use some Iris and cartopy functions. If Iris doesn’t have any functions for a statistical procedure that you want to apply, go back down another level and use SciPy. By starting at the top right and working your way back, you’ll ensure that you never re-invent the wheel. Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4) for extracting the metadata contained within a netCDF file, for instance, only to find that Iris automatically keeps this information upon reading a file. In this way, a solid working knowledge of the scientific Python stack can save you a lot of time and effort.