Software installation explained
Software installation has got to be one of the most frustrating, confusing and intimidating things that research scientists have to deal with. In fact, I’ve waxed poetic in the past (see this post) about the need to solve the software installation problem. Not only is it a vitally important issue for the sanity of everyday research scientists, but it’s also critically important to the open science movement. What’s the point of having everyone share their code and data, if nobody can successfully install the software that code depends on? This post is my attempt to summarise the current state of play regarding software installation in the weather and climate sciences. Things are far from perfect, but there are some encouraging things happening in this space.
There are four major ways in which you might go about installing a certain software package. From easiest to hardest, they go as follows:
1. Download an installer
This is the dream scenario. Upon navigating to the website of the software package you’re after, you discover a downloads page which detects your operating system and presents you with a link to download the appropriate installer (sometimes called a “setup program”). You run the installer on your machine, clicking yes to agree to the terms and conditions and checking the box to include a shortcut on your desktop, and hey presto the software works as advertised. If you’re using a proprietary package like MATLAB or IDL then this has probably been your experience. It takes many developer hours to create, maintain and support software installers, so this is where (some of) your license fees are going. Free software that is very widely used (e.g. Git) is also often available via an installer, however in most cases you get what you pay for when it comes to software installation…
2. Use a package manager
In the absence of an installer, your next best bet is to see whether the software you’re after is available via a package manager. All Linux operating systems have a package manager based on apt-get (e.g. the Ubuntu Software Centre), while there are a range of different managers available for Mac (e.g. Homebrew) and Windows (e.g. OneGet will come standard with Windows 10). The great thing about these managers is that they handle all the software dependencies associated with an install. For instance, if the command line tool you’re installing allows for the manipulation of netCDF files, then chances are that tool depends on the relevant netCDF libraries being installed on your machine too. Package managers are smart enough to figure this out, and will install all the dependencies along the way. They will also alert you to software updates (and install them for you if you like), which means in many cases a package manager install might even be preferable to downloading an installer.
The only downside to package managers is that there is often a time lag between when a new version of a software package is released and when it gets updated on the manager. If you want the “bleeding edge” version of a particular software package or if that package isn’t available via a package manager (only fairly widely used packages make it to that stage), then you slide further down the list to option 3…
3. Install from binaries
We are now beyond the point of just clicking a button and having the install happen before our eyes, so we need to learn a little more about software installation to figure out what’s going on. At the core of any software is the source code, which is usually a bunch of text files (e.g. like .c, .cpp, .h in case of software written in C/C++). In order to run that source code, you must first feed it through a compiler. Compiling then generates a binary, which is usually an .exe or a .dll file. To relieve users of the burden of having to compile the code themselves, software developers will often collect up all the relevant binaries in a zip file (or tarball) and make them available on the web (e.g. on a website like SourceForge). You then just have to unzip those binaries in an appropriate location on your machine. This sounds easy enough in theory, but in order to get the software working correctly there’s often an extra step – you essentially have to do the work of a package manager and install the software dependencies as well. This is occasionally impossible and almost always difficult.
(Note that an installer is basically just a zip file full of binaries that can unzip itself and copy the binaries to the right places on your computer.)
4. Install from the source code
If you’re feeling particularly brave and/or need the very latest version of a software package (e.g. perhaps a beta-version that hasn’t even been formally released yet), you can often download the source code from a site like GitHub. You now have to do the compilation step yourself, so there’s an added degree of difficulty. It turns out that even super experienced programmers avoid source code installs unless they absolutely have to.
Ok, so that’s a nice high level summary of the software installation hierarchy, but how does it actually play out in reality? To demonstrate, consider my personal software requirements (see this post for details):
- NCO for simple manipulation of netCDF files
- CDO for simple data analysis tasks on netCDF files
- Python for more complex data analysis tasks
- UV-CDAT for quickly viewing the contents of netCDF files
This is how the installation of each of these packages plays out on a modern Ubuntu, Mac and Windows machine (I keep a running log of my software installation troubles and solutions here if you’re interested):
NCO & CDO
NCO and CDO are available via both the Ubuntu Software Centre and Homebrew, so installation on Ubuntu and Mac is a breeze (although there are a few bugs with the Homebrew install for CDO). Things are a little more difficult for Windows. There are binaries available for both, however it doesn’t appear that the CDO binaries are particularly well supported.
Getting the Python standard library (i.e. the core libraries that come with any Python installation) working on your machine is a pretty trivial task these days. In fact, it comes pre-installed on Ubuntu and Mac. Until recently, what wasn’t so easy was getting all the extra libraries relevant to the weather and climate sciences playing along nicely with the standard library. The problem stems from the fact that while the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C (because speed is important when data arrays get really large). Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite Python libraries working together was problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released (for free) Anaconda, which bundles together around 200 of the most popular Python libraries for science, maths, engineering and data analysis. What’s more, if you need a library that isn’t part of the core 200 and can’t be installed easily with pip, then they have developed their own package manager called conda (see here and here for some great background posts about conda). People can write conda packages for their favourite Python libraries (which is apparently a fairly simple task for experienced developers) and post them on anaconda.org, and those conda packages can be used to install the libraries (and all their dependencies) on your own machine.
In terms of my personal Python install, the main extra libraries I care about are iris and cartopy (for plotting), xarray (for climate data analysis), eofs (for EOF analysis) and windspharm (for wind related quantities like the streamfunction and velocity potential). There are Linux, Mac and Windows flavoured conda packages for all four at the fantastic IOOS channel at anaconda.org, so installing them is as simple as entering something like this at the command line:
conda install -c http://conda.anaconda.org/ioos xarray
The availability of these packages for all three operating systems is something that has only happened very recently and won’t necessarily be the case for less widely used packages. The pattern I’ve noticed is that Linux packages tend to appear first, followed by Mac packages soon after. Widely used packages eventually get Windows packages as well, but in many cases this can take a while (if at all).
UV-CDAT has binaries available for Ubuntu and Mac, in addition to binaries for the dependencies (which is very nice of them). There are no binaries for Windows at this stage.
If you’re struggling when it comes to software installation, rest assured you definitely aren’t alone. The software installation problem is a source of frustration for all of us and is a key roadblock on the path to open science, so it’s great that solutions like anaconda.org are starting to pop up. In the meantime (i.e. while you’re waiting for a silver bullet solution), probably the best thing you can do is have a serious think about your operating system. I don’t like to take sides when it comes to programming languages, tools or operating systems, but the reality (as borne out in the example above) is that developers work on Linux machines, which means they first and foremost make their software installable on Linux machines. Macs are an afterthought that they do often eventually get around to (because Mac OS X is based on Linux so it’s not too hard), while Windows is an after-afterthought that often never gets addressed (because Windows is not Linux-based and is therefore often too hard) unless you’re dealing with a proprietary package that can afford the time and effort. If you want to make your life easy when it comes to scientific computing in the weather and climate sciences, you should therefore seriously consider working on a Linux machine, or at least on a Mac as a compromise.