Skip to content
May 15, 2014 / Damien Irving

A vision for data analysis in the weather and climate sciences

I’ve been a Software Carpentry instructor for a little over a year now, which is to say that I put aside a couple of days every now and then to teach basic software skills to scientists. The two-day “bootcamps” that we run are the brainchild of Greg Wilson, who has been teaching programming to scientists for over 15 years. I wasn’t there to see it in person, but a couple of weeks ago Greg gave a great talk at PyCon about the lessons he’s learned along the way. As you might imagine, his recorded talk and accompanying paper are chock-full of wonderful insights into how we might improve computational standards in the science community. In articulating my vision for data analysis in the weather and climate sciences, I wanted to focus on what Greg listed as his number one lesson:

“Most scientists think of programming as a tax they have to pay in order to do their science.”

In other words, most scientists could have majored in computer science if they wanted to, but instead pursued their passion in ecology, biology, genetics or meteorology. Among many other things, this means that scientists (a) don’t know anything about system admin, and (b) have no desire and/or time to learn. If it isn’t easy for them to get open source software installed and running on their machine, they’re not going to spend hours trawling terse developer discussion lists online. They’re either going to switch to a proprietary package like Matlab that is easy to install, or worse still give up on or modify the analysis they were planning to do. In short, if we want to foster a culture where scientists regularly participate in open and collaborative software development (which I’m sure is the evil master plan of the Mozilla Science Lab, which funds Software Carpentry), then as a first step we must solve the software installation problem.

I’m certainly not the first person to make this observation, and for general data analysis a company called Continuum Analytics has already solved the problem. Their (free) Anaconda product bundles together pretty much all the Python packages you could ever need for general data management, analysis, and visualisation. You literally just download the executable that matches your operating system and then, hey presto, everything is installed and working on your machine. Anaconda also comes with pip, which is the Python package manager that most people use to simply and easily install additional packages.

That’s all well and good for general data analysis, but in fields like the weather and climate sciences some of our work is highly specialised. It would be incredibly inefficient if every climate scientist had to take the general Python numerical library (numpy) and build upon it to write their own modules for calculating climatologies or seasonal anomalies, so there are community developed packages like UV-CDAT that do just that (the Climate Data Analysis Tools or CDAT part in particular includes a library called cdutil for calculating seasonal anomalies; see this post for an overview). In my own work there are also a couple of community developed packages called windspharm and eofs that I use to calculate wind related quantities (e.g. streamfunction, velocity potential) and to perform empirical orthogonal function analyses. These have been assimilated into the UV-CDAT framework, which means that when I run an interactive IPython session or execute a Python script, I’m able to import windspharm and eofs as well as the cdutil library. This makes my life much easier because these packages are not available via pip and I’d have no idea how to install CDAT, windspharm and eofs myself such that they were all available from the same IPython session.

At this point you can probably see where I’m going with this… UV-CDAT has the potential to become the Anaconda of the weather and climate sciences. The reason I preface this statement with ‘potential’ is that it’s not quite there yet. For example, the UK Met Office recently developed a package called Iris, which has built on the general Python plotting library (matplotlib) to make it much easier to create common weather and climate science plots like Hovmoller diagrams and geographic maps. Since it hasn’t been assimilated into the UV-CDAT framework, I have it installed separately on my machine. This means I cannot write a script that calculates the seasonal climatology using cdutil and then plots the output using Iris. I create all sorts of wacky work-arounds to cater for installation issues like this, but these ultimately cost me time and make it very difficult for me to make my code openly available with any publications I produce (i.e. I guess you can make a data processing workflow openly available when it depends on multiple separate Python installations, but it’s hardly user friendly!).

The UV-CDAT development team is certainly not to blame in this situation. On the contrary, they should be applauded for producing a package that has the potential to solve the installation issue in the weather and climate sciences. I also know that they are working to include more user contributed packages (e.g. aoslib), and it would be reasonable to expect that they would wait to see if Iris becomes popular before going to the effort of assimilating it. In my mind the key to UV-CDAT realising its potential is probably related to the fact that up until now it’s been much easier for them to get funding to develop the Ultrascale Visualisation (UV) part of the project, as opposed to the scripting environment. It’s a flashy graphical user interface for 3D data visualisation and has certainly made a substantial contribution to the processing of large datasets in the weather and climate sciences, however it’s only one piece of the puzzle. My vision (or hope) is that funding bodies begin to recognise that (a) the software installation issue is one of the major roadblocks to progress in the weather and climate sciences, and (b) UV-CDAT, much like Anaconda in the general data analysis sphere, is the vehicle that the community needs to rally around in order to overcome it. With ongoing financial support for the assimilation of new packages (including Iris in the first instance) and for updating/improving the documentation and user-support associated with the scripting environment, UV-CDAT could play a vitally important role in enabling a culture of regular and widespread community-scale software development in the weather and climate sciences.

Advertisements

15 Comments

Leave a Comment
  1. Scott Wales / May 15 2014 16:58

    It would be interesting to see tools like Docker and Vagrant get uptake in the scientific computing world – they can create self-contained virtual machines with specific packages installed, in effect creating a reproducible development environment that you can then share with others simply by passing around a configuration file.

    I think that it would be more beneficial for the community to work towards make these packages easier to install through pip, rather than integrating with UV-CDAT but remaining unavailable to Anaconda users. I remember Iris being a particular pain to install with all of its dependencies (hopefully that’s improved, not looked at it in a while)

    • Damien Irving / May 16 2014 09:01

      Hey Scott. I like your Docker and Vagrant idea, so I’ve posted it on the Mozilla Science Lab forum to see what kind of uptake it’s had (if any).

      I also like your suggestion of working towards having all these packages available via pip. I’m assuming, however, that if someone manually typed `pip install` for each of the 125+ libraries included in Anaconda, they wouldn’t all play nicely together (i.e. that value of Anaconda is that they’ve sorted out those issues). That is probably also true of all the packages available in UV-CDAT, which is why I’m such an advocate for an Anaconda for the weather and climate sciences. I could be wrong about this though (i.e. maybe all the packages would play nicely together when installed with pip), so I’ve asked Stack Overflow about it.

  2. Stephan Hoyer / May 15 2014 17:29

    Hi Damien,

    Nice blog post! I totally agree with your vision for data analysis in weather/climate sciences. In particularly, thanks for mentioning my package (xray).

    I’m confused though about one of your main pain points — needing separate Python installations for CDAT, Iris and xray. Iris and xray, at least, are designed to function as Python libraries, so you can install them alongside any other Python packages you need on whatever base Python install you like (e.g., Anaconda, Enthought’s Canopy, Python from python.org, the python install that comes bundled with your OS or whatever).

    From the way you describe it, it sounds like CDAT provides its own installation of Python. It is really not possible to either install CDAT packages on top of a different Python distribution, or alternatively, to install additional packages on top of CDAT’s python? That would really surprise me.

    At the very least, pandas (and xray) are on the Python package index (pypi), so you should be able to install them with easy_install or pip (preferred). (Iris has some binary dependencies which make it more complicated to get setup.) CDAT would be a very poor python distribution if it didn’t come with at least one of those tools. I’m guessing this is possible but just not very obvious or well documented.

    If you already have an Anaconda installation with Pandas, you should be able to get xray installed on top of that by just typing “pip install xray”.

    Cheers,
    Stephan

    • Stephan Hoyer / May 15 2014 17:37

      To followup on your main point, I don’t think incorporating more packages like Iris or Pandas directly into CDAT is the right approach. Instead, CDAT should more clearly expose its extensibility, so users can more easily install packages themselves. I would even suggest that CDAT should consider breaking itself up so that it’s easier to get particular submodules of interest installed.

      Lots of smaller, modular packages is generally considered a happier way forward from a software engineering perspective.

      • Damien Irving / May 16 2014 09:16

        There used to be a cut down version of CDAT known as CDAT-lite, however it’s a very old package and I’m not sure that it’s kept up with any of the latest edits and bug fixes to the cdat code base at https://github.com/UV-CDAT/uvcdat

    • Damien Irving / May 16 2014 09:14

      Hey Stephan. Thanks for your comments – much appreciated. I’d had issues in the past when using `/usr/local/uvcdat/1.4.0/bin/pip install pandas`, but with one of the more recent versions of UV-CDAT it seems to have worked (`/usr/local/uvcdat/1.4.0/bin/pip install xray` worked too, so I’ll definitely be having a play with that). I’ve edited the text of the post accordingly (which means I removed the reference to xray, but I’m thinking of writing a separate post about that anyway) and also started a Stack overflow question to see whether packages installed with pip always play nicely together. Unfortunately Iris is not available via pip and neither is CDAT…

  3. Kevin Havener / May 15 2014 22:50

    UV-CDAT has a long way to go as far as installability itself. I did finally get 1.5.1 to install and work under Ubuntu 14.04. Still working on Fedora (Rawhide, in my case). Their install help page is marginally useful, listing possible dependency problems with packages that are unidentifiable between Linux distros–they use Ubuntu/Debian .deb package names that have no apparent equivalents in the Red Hat ecosystem. Hell, some of the dependencies listed don’t even exist in the Ubuntu/Debian ecosystem. So, I punt to their support forum. That page is broken, too. Most of the web pages linked to from the UV-CDAT home page are broken or useless. So I subscribed to their mailing list and ended up here.

    Anaconda is brilliant. Here is what I would suggest they do. Instead of rolling their own python distro, use Anaconda and create a uvcdat environment within the overall Anaconda install that has the UV-CDAT specific modules and versions of external packages. A user would download anaconda, installing the packages/versions that comes by default. Set up a virtual python environment within Anaconda to satisfy UV-CDAT dependencies (note: virtual environments are a python capability, not strictly an Anaconda capability, so theoretically, you could have a scientist set up a virtual environment within whatever python is available). Then have the UV-CDAT driver invoke that instance of Python. I also have a pretty good idea that we can expect a similarly managed R capability from Continuum Analytics real soon now. I think UV-CDAT needs to talk to Continuum if they haven’t already.

    For another good example of scientific software integration, look at Sage (www.sagemath.org). I think that project is integrates even more disparate open source software than UV-CDAT. All its dependencies are self-contained (except needs POSIX) and I’ve never had a problem installing it. Big install though, but disk space is cheap. They also benefit from the fact that Math community is larger than the our community so the community development seems to go better. I’m sure the UV-CDAT potential user/developer community is much smaller.

    • Damien Irving / May 16 2014 09:20

      Hi Kevin. Thanks for your comment. I agree that the installability and documentation of UV-CDAT has a long way to go. In part I think my post was a frustrated plea for someone to fund improvements in these areas. Your idea of a UV-CDAT environment in Anaconda is an interesting one… I’ll point the UV-CDAT discussion list at your comment and see how they respond.

    • Stephan Hoyer / May 16 2014 10:53

      I agree, it would be awesome if UV-CDAT was simply a conda package, ideally with each submodule available on its own (and as part of whole thing). It is actually not hard at all to make your conda packages, and Continuum even has a service for hosting them. See: https://binstar.org/

  4. El Niño / May 16 2014 08:33

    Hi Damien,
    perhaps you can enlighten me here: why isn’t CDAT better integrated in the pylab (numpy/scipy/matplotlib) stack? The real value of CDAT to me are packages cdutils and cdms that take the pain out of accessing climate data in all of our quirky (e.g. grads) formats, and do some simple processing (e.g. monthly anomalies, seasonal averages). For all the *real* data analysis (spectral, EOF, maximum covariance analysis, regression, etc), one is either S.OL. or has to reinvent the wheel within CDAT, while there is a wordwide wealth of expertise and code associated with SciPy that remains untapped by much of the climate community. Wouldn’t it make more sense for CDAT to better fit the pylab stack, or at least give a ready access to it, so people who really want to do some sophisticated analysis (which I do in Matlab at the moment) don’t have to find workarounds to deal with CDAT’s avoidance of numpy arrays?
    I have to confess that I know very little of the constraints under which CDAT is developed, and whether this is even possible.
    Thanks for a great post, BTW!

    • Damien Irving / May 16 2014 09:29

      Hi. I hope I didn’t give the impression that CDAT isn’t integrated with the numpy/scipy/matplotlib stack, because it certainly is. Check out my post, “a beginners guide to scripting with UV-CDAT“. It’s just not integrated with Iris (and I’m sure some other useful packages that people have developed that aren’t part of the core SciPy stack or available via pip). My post is basically a shout out for more funding to improve the documentation for the UV-CDAT scripting environment (i.e. so that people know that the SciPy stack is available), to make the installation process easier, and to make sure there is continual support for assimilating great new packages like Iris when they pop up. Does that make sense?

      • El Niño / May 16 2014 13:56

        Hi Damien, sure that makes sense. I do it is possible to use the SciPy stack in association with UV-CDAT, it;’s just a little less intuitive than I think it should be. But I admit to being one of of tax-avoiding climate scientists who are after tools to do some science, not spend a life fixing computer issues. I am very excited about the open source movement afoot in our field, of which Python is I think the crux. This is essential for reproducible research, and you have a point when you say that it should not require 3 separate installations to reproduce the results of one paper…

  5. Damien Irving / May 17 2014 12:34

    Here’s an additional comment that I received via email:

    You have missed the critical point in my view about where climate data analysis is headed. And that is to very large datasets. I have personally generated 400TB of model output this year and post-processed about a quarter of that. And I did not nor could not do that on my workstation. Rather, because of data transfer limitations, nearly all of my analysis had to be performed at the supercomputing center where the data was generated. I used UVCDAT exclusively for that task, but I did not install it on the system. Rather that is done by professional software engineers, so to zeroth order, it does not matter to me how it is installed. Much more critical to me is how the analysis programs perform in parallel. And in that regard, we have a long ways to go. Much of climate data analysis should, in principle, be embarrassingly parallel. However, realized parallel efficiency in the 1000 processor range is less than 10 percent for reasons unknown.

    CMIP6 will generate much more data than CMIP5 (Dean has estimated it, but I don’t recall the number of petabytes), so this aspect of our analysis workflow will be critical to a comprehensive analysis of the coming generation of climate models.

    • Damien Irving / May 17 2014 13:14

      I agree that many climate scientists (like yourself) spend their days processing large volumes of data on supercomputing facilities that provide substantial system admin support. For these people improvements in parallel processing (which I understand is something the UV-CDAT project has been working hard on) are of primary importance importance, while installation issues are really only of secondary concern. In outlining a broad vision for data analysis in the weather and climate sciences, you’re therefore right in suggesting that I really should have mentioned parallel processing (and I certainly wasn’t suggesting that UV-CDAT shouldn’t continue to focus on that going forward).

      There is, however, another subset of climate scientists (like myself) who do not have access to substantial system admin support from a supercomputing center and who are not dealing with massive datasets. These scientists might be processing reanalysis datasets or observations from stations and radiosondes, which can be stored and analysed on smaller department owned computers (which have at best non-specialised system admin support and at worst none at all) or even their own personal laptop. For these people solving the installation problem is of primary concern (i.e. they are the people who regularly contact the UV-CDAT support email address asking for assistance), while parallel computing is really only a secondary concern (but still relevant since laptops these days have multiple CPUs and even GPUs that could be used to speed things up). I guess it’s this subset of scientists that my post focused on.

Trackbacks

  1. Software installation explained | Dr Climate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: