Skip to content
September 4, 2015 / Damien Irving

Managing your data

If you’re working on a project that involves collecting (e.g. from a network of weather stations) or generating (e.g. running a model) data, then it’s likely that one of the first things you did was develop a data management plan. Many funding agencies (e.g. the National Science Foundation) actually formally require this, and such plans usually involve outlining your practices for collecting, organising, backing up, and storing the data you’ll be generating.

What many people don’t realise is that even if you aren’t collecting or generating your own data (e.g. you might simply download a reanalysis or CMIP5 dataset), you should still start your project by developing a data management plan. That plan obviously doesn’t need to consider everything a data collection/generation project does (e.g. you don’t need to think about archiving the data at a site like Figshare), but there are a few key things all data analysis projects need to consider, regardless of whether they collected and/or generated the original data or not.
 
1. Data Reference Syntax

The first thing to define is your Data Reference Syntax (DRS) – a convention for naming your files. As an example, let’s look at a file from the data archive managed by Australia’s Integrated Marine Observing System (IMOS).

.../thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz

That’s a lot of information to take in, so let’s focus on the structure of the file directory first:

.../thredds/dodsC/<project>/<organisation>/<collection>/<facility>/<data-type>/<site-code>/<year>/

From this we can deduce, without even inspecting the contents of the file, that we have data from the IMOS project that is run by the eMarine Information Infrastructure (eMII). It was collected in 2012 at the Turquoise Coast, Western Australia (TURQ) site of the Australian Coastal Ocean Radar Network (ACORN), which is a network of high frequency radars that measure the ocean surface current. The data type has a sub-DRS of its own, which tells us that the data represents the 1-hourly average surface current for a single month (October 2012), and that it is archived on a regularly spaced spatial grid and has not been quality controlled. The file is located in the “demos” directory, as it has been generated for the purpose of providing an example for users at the very helpful Australian Ocean Data Network user code library.

Just in case the file gets separated from this informative directory structure, much of the information is repeated in the file name itself, along with some more detailed information about the start and end time of the data, and the last time the file was modified:

<project>_<facility>_V_<time-start>_<site-code>_FV00_<data-type>_<time-end>_<modified>.nc.gz

In the first instance this level of detail seems like a bit of overkill, but consider the scope of the IMOS data archive. It is the final resting place for data collected by the entire national array of oceanographic observing equipment in Australia, which monitors the open oceans and coastal marine environment covering physical, chemical and biological variables. Since the data are so well labelled, locating all monthly timescale ACORN data from the Turquoise Coast and Rottnest Shelf sites (which represents hundreds of files) would be as simple as typing the following at the command line:

$ ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc

While it’s unlikely that your research will ever involve cataloging data from such a large observational network, it’s still a very good idea to develop your own personal DRS for the data you do have. This often involves investing some time at the beginning of a project to think carefully about the design of your directory and file name structures, as these can be very hard to change later on. The combination of bash shell wildcards and a well planned DRS is one of the easiest ways to make your research more efficient and reliable.
 
2. Data provenance

In defining my own DRS, I added some extra fields to cater for the intermediary files that typically get created throughout the data analysis process. For instance, I added a field to indicate the temporal aspects of the data (e.g. to indicate if the data are an anomaly relative to some base period) and another for the spatial aspects (e.g. to indicate whether the data have been re-gridded). While keeping track of this information via the DRS is a nice thing to do (it definitely helps with bash wildcards and visual identification of files), more detailed information needs to be recorded for the data to be truly reproducible. A good approach to recording such information is the procedure followed by the Climate Data Operators (CDO) and NetCDF Operators (NCO). Whenever an NCO or CDO utility (e.g. ncks, ncatted, cdo mergetime) is executed at the command line, a time stamp followed by a copy of the command line entry is automatically appended to the global attributes of the output netCDF file, thus maintaining a complete history of the data processing steps. Here’s an example:

Tue Jun 30 07:35:49 2015: cdo runmean,30 va_ERAInterim_500hPa_daily_native.nc va_ERAInterim_500hPa_030day-runmean_native.nc

You might be thinking, “this is all well and good, but what about data processing steps that don’t use NCO, CDO or even netCDF files?” It turns out that if you write a script (e.g. in Python, R or whatever language you’re using) that can be executed from the command line, then it only takes an extra few lines of code to parse the associated command line entry and append that information to the global attributes of a netCDF file (or a corresponding metadata text file if dealing with file formats that don’t carry their metadata with them). To learn how to do this using Python, check out the Software Carpentry lesson on Data Management in the Ocean, Weather and Climate Sciences.
 
3. Backup

Once you’ve defined your DRS and have implemented the NCO/CDO approach to data provenance, the final thing to think about is backing up your data. This is something I’ve discussed in detail in a previous post, but the crux of the story is that if your starting point files (i.e. the data files required at the very first step of your data processing) can be easily downloaded (e.g. reanalysis or CMIP5 data), then you probably don’t need your local copy to be backed up. All of your code should be version controlled and backed up via an external hosting service like GitHub and Bitbucket, so you can simply re-download the data and re-run your analysis scripts if disaster strikes. If you generated your starting point files from scratch on the other hand (e.g. you collected weather observations or ran a model that would take months to re-run), then backup is absolutely critical and would be part of your data management plan.

 

Advertisements
June 3, 2015 / Damien Irving

The CWSLab workflow tool: an experiment in community code development

Give anyone working in the climate sciences half a chance and they’ll chew your ear off about CMIP5. It’s the largest climate modelling project ever conducted and formed the basis for much of the IPCC Fifth Assessment Report, so everyone has an opinion on which are the best models, the level of confidence we should attach to projections derived from the models, etc, etc. What they probably won’t tell you about is the profound impact that CMIP5 has had on climate data processing and management. In the lead up to CMIP5 (2010/11), I was working at CSIRO in a support scientist role. When I think back on that time, I refer to it as The Great Data Duplication Panic. In analysing output from CMIP3 and earlier modelling projects, scientists simply downloaded data onto their local server (or even personal computer) and did their own analysis in isolation. At the CSIRO Aspendale campus alone there must have been a dozen copies of the CMIP3 dataset floating around. Given its sheer size (~3 PetaBytes!), we recognised very quickly that this kind of data duplication just wasn’t going to fly for CMIP5.

Support scientists at CSIRO and the Bureau of Meteorology were particularly panicked about two types of data duplication: download duplication (i.e. duplication of the original dataset) and processing duplication (e.g. duplication of similarly processed data such as a common horizontal regridding or extraction of the Australian region). It was out of this panic that the Climate and Weather Science Laboratory (CWSLab) was born (although it wasn’t called that back then).

Download duplication

The download duplication problem has essentially been addressed by two major components of the CWSLab project. The NCI data library stores a variety of local and international climate and weather datasets (including CMIP5), while the NCI computational infrastructure is built directly on top of that library so you can do your data processing in situ (i.e. as opposed to downloading the data to your own machine). The computational infrastructure consists of Raijin (a powerful supercomputer) and the NCI High Performance Cloud for super complex and/or data-intensive tasks, while for everyday work they have the CWS Virtual Desktops. These virtual desktops have more grunt than your personal laptop or desktop (4 CPUs, 20 GB RAM, 66 GB storage) and were deemed the best way to provide scientists with remote access to data exploration tools like MATLAB and UV-CDAT that involve a graphical user interface.

While solving the download duplication problem has been a fantastic achievement, it was aided by the fact that the solution didn’t require everyday climate scientists to change their behaviour in any appreciable way. They simply login to a machine at NCI rather than their local server and proceed with their data analysis as per normal. The processing duplication problem on the other hand will require a change in behaviour and may therefore be more difficult to solve…

Processing duplication

The CWSLab answer to the processing duplication problem is the CWSLab workflow tool, which can be run from the CWS Virtual Desktop. The tool is a plugin/add-on to the VisTrails workflow and provenance management system (see this previous post for a detailed discussion of workflow automation) and allows you to build, run and capture metadata for analyses involving the execution of multiple command line programs (see this example Nino 3.4 workflow). The code associated with the VisTrails plugin is hosted in three separate public GitHub repositories:

  • cwsl-ctools: A collection of command line programs used in performing common climate data analysis tasks. The programs can be written in any programming language, they just have to be able to parse the command line.
  • cwsl-mas: The source code for the plugin. In essence, it contains a wrapper for each of the command line programs in the cwsl-ctools repo which tells VisTrails how to implement that program.
  • cwsl-workflows: A collection of example workflows that use the VisTrails plugin.

The CWSLab workflow tool writes output files using a standardised data reference syntax, which is how it’s able to solve the processing duplication problem. For instance, if someone has already regridded the ACCESS1-0 model to a 1 by 1 degree global grid, the system will be able to find that file rather than re-creating/duplicating it.

A community model for help and code development

Unlike the NCI infrastructure and data library which have dedicated staff, the group of support scientists behind the VisTrails plugin have very small and infrequent time allocations on the project. This means that if the workflow tool is to succeed in the long run, all scientists who do climate model analysis at NCI will need to pitch in on both code development and requests for help.

Fortunately, GitHub is perfectly setup to accommodate both tasks. Scientists can “fork” a copy of the cwsl code repositories to their own GitHub account, make any changes to the code that they’d like to see implemented (e.g. a new script for performing linear regression), and then submit a “pull request” to the central cwsl repo. The community can then view the proposed changes and discuss them before finally accepting or rejecting. Similarly, instead of a help desk, requests for assistance are posted to the cwsl-mas chat room on Gitter. These rooms are a new feature associated with GitHub code repositories that are specifically designed for chatting about code. People post questions, and anyone in the community who knows the answer can post a reply. If the question is too long/complex for the chat room, it can be posted as an issue on the relevant GitHub repo for further community discussion.

Multiple birds with one stone

By adopting a community approach, the workflow tool addresses a number of other issues besides data duplication.

  • Code review. Software developers review each other’s code all the time, but scientists never do. The Mozilla Science Lab have now run two iterations of their Code Review for Scientists project to figure out when and how scientific code should be reviewed, and their findings are pretty clear. Code review at the completion of a project (e.g. when you submit a paper to a journal) is fairly pointless, because the reviewer hasn’t been intimately involved in the code development process (i.e. they can make cosmetic suggestions but nothing of substance). Instead, code review needs to happen throughout a scientific research project. The pull request system used by the CWSLab workflow tool allows for this kind of ongoing review.
  • Code duplication. Any scientist that is new to climate model data analysis has to spend a few weeks (at least) writing code to do basic data input/output and processing. The cwsl-ctools repo means they no longer need to reinvent the wheel – they have access to a high quality (i.e. lots of people have reviewed it) code repository for all those common and mundane data analysis tasks.
  • Reproducible research. The reproducibility crisis in computational research has been a topic of conversation in the editorial pages of Nature and Science for a number of years now, however very few papers in today’s climate science journals include sufficient documentation (i.e. details of the software and code used) for readers to reproduce key results. The CWSLab workflow tool automatically captures detailed metadata about a given workflow (right down to the precise version of the code that was executed; see here for details) and therefore makes the generation of such documentation easy.

Conclusion

The CWSLab workflow tool is an ambitious and progressive initiative that will require a shift in the status quo if it is to succeed. Researchers will need to overcome the urge to develop code in isolation and the embarrassment associated with sharing their code. They’ll also have to learn new skills like version control with git and GitHub and how to write scripts that can parse the command line. These things are not impossible (e.g. Software Carpentry teaches command line programs and version control in a single afternoon) and the benefits are clear, so here’s hoping it takes off!

April 6, 2015 / Damien Irving

Workflow automation

In previous posts (e.g. What’s in your bag?) I’ve discussed the various tools I use for data analysis. I use NCO for making simple edits to the attributes of netCDF files, CDO for routine calculations on netCDF files and a whole range of Python libraries for doing more complicated analysis and visualisation. In years gone by I’ve also included NCL and Fortran in the mix. Such diversity is pretty common (i.e. almost nobody uses a single programming language or tool for all their analysis) so this post is my attempt at an overview of workflow automation. In other words, how should one go about tying together the various tools they use to produce a coherent, repeatable data analysis pipeline?

The first thing to note is that the community has not converged on a single best method for workflow automation. Instead, there appears to be three broad options depending on the complexity of your workflow and the details of the tools involved:

  1. Write programs that act like any other command line tool and then combine them with a shell script or build manager
  2. Use an off-the-shelf workflow management system
  3. Write down the processing steps in a lab notebook and re-execute them manually

Let’s consider these approaches one by one:

 

1. Command line

Despite the fact that its commands are infuriatingly terse and cryptic, the Unix shell has been around longer than most of its users have been alive. It has survived so long because of the ease with which (a) repetitive tasks can be automated and (b) existing programs can be combined in new ways. Given that NCO and CDO are command line tools (i.e. you’re probably going to be using the command line anyway), it could be argued that the command line is the most natural home for workflow automation in the weather and climate sciences. For instance, in order to integrate my Python scripts with the rest of my workflow, I use the argparse library to make those scripts act like any other command line program. They can be executed from the command line, ingest arguments and options from the command line, combine themselves with other command line programs via pipes and filters, and output help information just like a regular command line program.

Armed with my collection of command line-native Python scripts, the easiest way to link multiple processing steps is to store them in a shell script. For instance, I could execute the following hypothetical workflow by storing all the steps (i.e. command line entries) in a shell script called run-streamfunction.sh.

  1. Edit the “units” attribute of the original zonal and meridional wind data netCDF files (NCO)
  2. Calculate the streamfunction from the zonal and meridional wind (calc_streamfunction.py)
  3. Calculate the streamfunction anomaly by subtracting the climatological mean at each timestep (CDO)
  4. Apply a 30 day running mean to the streamfunction anomaly data (CDO)
  5. Plot the average streamfunction anomaly for a time period of interest (plot_streamfunction.py)

This would be a perfectly valid approach if I was dealing with a small dataset, but let’s say I wanted to process 6 hourly data from the JRA-55 reanalysis dataset over the period 1958-2014 for the entire globe. The calc_streamfunction.py script I wrote would take days to run on the server in my department in this situation, so I’d rather not execute every single step in run-streamfunction.sh every time I change the time period used for the final plot. What I need is a build manager – a smarter version of run-streamfunction that can figure out whether previous steps have already been executed and if they need to be updated.

The most widely used build manager on Unix and its derivatives is called Make. Like the Unix shell it is old, cryptic and idiosyncratic, but it’s also fast, free and well-documented, which means it has stood the test of time. I started using Make to manage my workflows about a year ago and it has revolutionised the way I work. I like it because of the documentation and also the fact that it’s available no matter what machine I’m on, however there are other options (e.g. doit, makeflow, snakemake, ruffus) if you’d like something a little less cryptic.

To learn how to apply the command line approach to your own workflow automation, check out these Software Carpentry lessons:

 

2. Workflow management systems

The command line discussion above suggests the use of shell scripts for automating small, simple data processing pipelines, and build managers like Make and doit for pipelines that are either slightly more complicated or have steps that you’d rather not repeat unnecessarily (e.g. steps that take many hours/days to run). For many weather and climate scientists (myself included), this is as far as you’ll need to go. Make and doit have all the functionality you’ll ever really need for automatically executing a data analysis pipeline, and by following the process laid out in the data management lesson linked to above you’ll be able to document that pipeline (i.e. produce a record of the provenance of your data).

But what if you’re working on a project that is too big and complicated for a simple Makefile or two? The management of uber-complex workflows such as those associated with running coupled climate models or processing the whole CMIP5 data archive can benefit greatly from specialised workflow management systems like VisTrails, pyRDM, Sumatra or Pegasus. These systems can do things like manage resource allocation for parallel computing tasks, execute steps that aren’t run from the command line, automatically publish data to a repository like Figshare and produce nice flowcharts and web interfaces to visualise the entire workflow.

I’ve never used one of these systems, so I’d love to hear from anyone who has. In particular, I’m curious to know whether such tools could be used for smaller/simpler workflows, or whether the overhead associated with setting up and learning the system cancels out any benefit over simpler options like Make and doit.

 

3. The semi-manual approach

While writing command line programs is a relatively simple and natural thing to do in Python, it’s not the type of workflow that is typically adopted by users of more self-contained environments like MATLAB and IDL. From my limited experience/exposure, it appears that users of these environments tend not to link the data processing that happens within MATLAB or IDL with processing that happens outside of it. For instance, they might pre-process their data using NCO or CDO at the command line, before feeding the resulting data files into MATLAB or IDL to perform additional data processing and visualisation. This break in the workflow implies that some manual intervention is required to check whether previous processing steps need to be executed and to initiate the next step in the process (i.e. something that Make or doit would do automatically). Manual checking and initiation is not particularly problematic for smaller pipelines, but can be error prone (and time consuming) as workflows get larger and more complex.

Since I’m by no means an expert in MATLAB or IDL, I’d love to hear how regular users of those tools manage their workflows.

January 31, 2015 / Damien Irving

Plugging into the computational best practice discussion

From the time we write our first literature review as a graduate or honours year student, we are taught the importance of plugging into the conversation around our research topic of interest. We subscribe to journal email alerts, set up automated searches on databases like Web of Science, join the departmental journal reading club and attend relevant sessions at conferences. If we want to get a job one day, we also figure out pretty quickly which email lists and job boards to keep an eye on (see this post for an overview). A discussion that people tend not to be so engaged with, however, is that around computational best practices. Modern weather and climate scientists spend a lot of time writing and debugging code, but discussions around the best tools and tricks for doing this are a little harder to find. As such, here’s my attempt at a consolidated list of the best places to tune in.

Online:

  • Twitter is an absolute gold mine when it comes to quality discussions about computational best practice. Start by following accounts like @swcarpentry, @datacarpentry, @mozillascience, @victoriastodden and @openscience and you’ll soon identify the other key people to follow.
  • Nature has recently started a Toolbox section on its website, which features regular articles about scientific software, apps and online tools. (I recently featured in a question and answer piece about the software used in the weather and climate sciences.)
  • The Mozilla Science Lab Forum hosts all sorts of discussions about open science and computational research.
  • This blog! (and also the PyAOS blog if you’re into Python)

Offline (i.e. in person!):

  • Two-day workshops hosted by Software Carpentry or its new sister organisation Data Carpentry are the perfect place to get up-to-date with the latest tips and tricks. Check their websites for upcoming workshops and if there isn’t one in your region, email them to tee one up for your home department, institution or conference. If you enjoy the experience, stay involved as a volunteer helper and/or instructor and you’ll always be a part of the computational conversation.
  • Don’t tell my colleagues, but I’ve always found scientific computing conferences like SciPy, PyCon or the Research Bazaar conference to be way more useful than the regular academic conferences I usually go to. (Check out my reflections on PyCon Australia here)
  • Local open science and/or data science meetups are really starting to grow in popularity. For instance, the Research Bazaar project hosts a weekly “Hacky Hour” at a bar on campus at the University of Melbourne, while a bunch of scientists have got together to form Data Science Hobart in Tasmania. If such a meetup doesn’t exist in your area, get a bunch of colleagues together and start one up!
October 30, 2014 / Damien Irving

Software installation explained

Software installation has got to be one of the most frustrating, confusing and intimidating things that research scientists have to deal with. In fact, I’ve waxed poetic in the past (see this post) about the need to solve the software installation problem. Not only is it a vitally important issue for the sanity of everyday research scientists, but it’s also critically important to the open science movement. What’s the point of having everyone share their code and data, if nobody can successfully install the software that code depends on? This post is my attempt to summarise the current state of play regarding software installation in the weather and climate sciences. Things are far from perfect, but there are some encouraging things happening in this space.

 

In Theory…

There are four major ways in which you might go about installing a certain software package. From easiest to hardest, they go as follows:

1. Download an installer

This is the dream scenario. Upon navigating to the website of the software package you’re after, you discover a downloads page which detects your operating system and presents you with a link to download the appropriate installer (sometimes called a “setup program”). You run the installer on your machine, clicking yes to agree to the terms and conditions and checking the box to include a shortcut on your desktop, and hey presto the software works as advertised. If you’re using a proprietary package like MATLAB or IDL then this has probably been your experience. It takes many developer hours to create, maintain and support software installers, so this is where (some of) your license fees are going. Free software that is very widely used (e.g. Git) is also often available via an installer, however in most cases you get what you pay for when it comes to software installation…

2. Use a package manager

In the absence of an installer, your next best bet is to see whether the software you’re after is available via a package manager. All Linux operating systems have a package manager based on apt-get (e.g. the Ubuntu Software Centre), while there are a range of different managers available for Mac (e.g. Homebrew) and Windows (e.g. OneGet will come standard with Windows 10). The great thing about these managers is that they handle all the software dependencies associated with an install. For instance, if the command line tool you’re installing allows for the manipulation of netCDF files, then chances are that tool depends on the relevant netCDF libraries being installed on your machine too. Package managers are smart enough to figure this out, and will install all the dependencies along the way. They will also alert you to software updates (and install them for you if you like), which means in many cases a package manager install might even be preferable to downloading an installer.

The only downside to package managers is that there is often a time lag between when a new version of a software package is released and when it gets updated on the manager.  If you want the “bleeding edge” version of a particular software package or if that package isn’t available via a package manager (only fairly widely used packages make it to that stage), then you slide further down the list to option 3…

3. Install from binaries

We are now beyond the point of just clicking a button and having the install happen before our eyes, so we need to learn a little more about software installation to figure out what’s going on. At the core of any software is the source code, which is usually a bunch of text files (e.g. like .c, .cpp, .h in case of software written in C/C++). In order to run that source code, you must first feed it through a compiler. Compiling then generates a binary, which is usually an .exe or a .dll file. To relieve users of the burden of having to compile the code themselves, software developers will often collect up all the relevant binaries in a zip file (or tarball) and make them available on the web (e.g. on a website like SourceForge). You then just have to unzip those binaries in an appropriate location on your machine. This sounds easy enough in theory, but in order to get the software working correctly there’s often an extra step – you essentially have to do the work of a package manager and install the software dependencies as well. This is occasionally impossible and almost always difficult.

(Note that an installer is basically just a zip file full of binaries that can unzip itself and copy the binaries to the right places on your computer.)

4. Install from the source code

If you’re feeling particularly brave and/or need the very latest version of a software package (e.g. perhaps a beta-version that hasn’t even been formally released yet), you can often download the source code from a site like GitHub. You now have to do the compilation step yourself, so there’s an added degree of difficulty. It turns out that even super experienced programmers avoid source code installs unless they absolutely have to.

 

In Practice…

Ok, so that’s a nice high level summary of the software installation hierarchy, but how does it actually play out in reality? To demonstrate, consider my personal software requirements (see this post for details):

  • NCO for simple manipulation of netCDF files
  • CDO for simple data analysis tasks on netCDF files
  • Python for more complex data analysis tasks
  • UV-CDAT for quickly viewing the contents of netCDF files

This is how the installation of each of these packages plays out on a modern Ubuntu, Mac and Windows machine (I keep a running log of my software installation troubles and solutions here if you’re interested):

NCO & CDO

NCO and CDO are available via both the Ubuntu Software Centre and Homebrew, so installation on Ubuntu and Mac is a breeze (although there are a few bugs with the Homebrew install for CDO). Things are a little more difficult for Windows. There are binaries available for both, however it doesn’t appear that the CDO binaries are particularly well supported.

Python

Getting the Python standard library (i.e. the core libraries that come with any Python installation) working on your machine is a pretty trivial task these days. In fact, it comes pre-installed on Ubuntu and Mac. Until recently, what wasn’t so easy was getting all the extra libraries relevant to the weather and climate sciences playing along nicely with the standard library. The problem stems from the fact that while the default Python package installer (pip) is great at installing libraries that are written purely in Python, many scientific / number crunching libraries are written (at least partly) in faster languages like C (because speed is important when data arrays get really large). Since pip doesn’t install dependencies like the core C or netCDF libraries, getting all your favourite Python libraries working together was problematic (to say the least). To help people through this installation nightmare, Continuum Analytics have released (for free) Anaconda, which bundles together around 200 of the most popular Python libraries for science, maths, engineering and data analysis. What’s more, if you need a library that isn’t part of the core 200 and can’t be installed easily with pip, then they have developed their own package manager called conda (see here and here for some great background posts about conda). People can write conda packages for their favourite Python libraries (which is apparently a fairly simple task for experienced developers) and post them on anaconda.org, and those conda packages can be used to install the libraries (and all their dependencies) on your own machine.

In terms of my personal Python install, the main extra libraries I care about are iris and cartopy (for plotting), xarray (for climate data analysis), eofs (for EOF analysis) and windspharm (for wind related quantities like the streamfunction and velocity potential). There are Linux, Mac and Windows flavoured conda packages for all four at the fantastic IOOS channel at anaconda.org, so installing them is as simple as entering something like this at the command line:

conda install -c http://conda.anaconda.org/ioos xarray

The availability of these packages for all three operating systems is something that has only happened very recently and won’t necessarily be the case for less widely used packages. The pattern I’ve noticed is that Linux packages tend to appear first, followed by Mac packages soon after. Widely used packages eventually get Windows packages as well, but in many cases this can take a while (if at all).

UV-CDAT

UV-CDAT has binaries available for Ubuntu and Mac, in addition to binaries for the dependencies (which is very nice of them). There are no binaries for Windows at this stage.

 

In Conclusion…

If you’re struggling when it comes to software installation, rest assured you definitely aren’t alone. The software installation problem is a source of frustration for all of us and is a key roadblock on the path to open science, so it’s great that solutions like anaconda.org are starting to pop up. In the meantime (i.e. while you’re waiting for a silver bullet solution), probably the best thing you can do is have a serious think about your operating system. I don’t like to take sides when it comes to programming languages, tools or operating systems, but the reality (as borne out in the example above) is that developers work on Linux machines, which means they first and foremost make their software installable on Linux machines. Macs are an afterthought that they do often eventually get around to (because Mac OS X is based on Linux so it’s not too hard), while Windows is an after-afterthought that often never gets addressed (because Windows is not Linux-based and is therefore often too hard) unless you’re dealing with a proprietary package that can afford the time and effort. If you want to make your life easy when it comes to scientific computing in the weather and climate sciences, you should therefore seriously consider working on a Linux machine, or at least on a Mac as a compromise.

August 28, 2014 / Damien Irving

Speeding up your code

In today’s modern world of big data and high resolution numerical models, it’s pretty easy to write a data analysis script that would take days/weeks (or even longer) to run on your personal (or departmental) computer. With buzz words like high performance computing, cloud computing, vectorisation, supercomputing and parallel programming floating around, what’s not so easy is figuring out the best course of action for speeding up that code. This post is my attempt to make sense of all the options…

 

Step 1: Vectorisation

The first thing to do with any slow script is to use a profiling tool to locate exactly which part/s of the code are taking so long to run. All programming languages have profilers, and they’re normally pretty simple to use. If your code is written in a high-level language like Python, R or Matlab, then the bottleneck is most likely a loop of some sort (e.g. a “for” or “while” loop). These languages are designed such that the associated code is relatively concise and easy for humans to read (which speeds up the code development process), but the trade-off is that they’re relatively slow for computers to run, especially when it comes to looping.

If a slow loop is at fault, then your first course of action should be to see if that loop can be vectorised. For instance, let’s say you’ve got some temperature data on a time/latitude/longitude grid, and you want to convert all the values from Kelvin to Celsius. You could loop through the three dimensional grid and subtract 273.15 at each point, however if your data came from a high resolution global climate model then this could take a while. Instead, you should take advantage of the fact that your high-level programming language almost certainly supports vectorised operations (aka. array programming). This means there will be a way to apply the same operation to an entire array of data at once, without looping through each element one by one. In Python, the NumPy extension supports array operations. Under the hood NumPy is actually written in a low-level language called C, but as a user you never see or interact with that C code (which is lucky because low-level code is terse and not so human friendly). You simply benefit from the speed of C (and the years of science that has gone into optimising array operations), with the convenience of coding in Python.

If you get creative – and there are lots of useful NumPy routines (and their equivalents in other languages) out there to help with this – almost any loop can be vectorised. If for some reason your loop isn’t amenable to vectorisation (e.g. you might be making quite complicated decisions at each grid point using lots of “if” statements), then another option would be to re-write that section of code in a low-level language like C or Fortran. Most high-level languages allow you to interact with functions written in C or Fortran, so you can then incorporate that function into your code.

 

Step 2: Exploit the full capabilities of your hardware

Ok, so let’s say you’ve vectorised all the loops that you can, you’ve written any other slow loops in a low-level language, and your code is still too slow (or perhaps you can’t be bothered learning a low-level language and have skipped that option, which is totally understandable). The next thing to try is parallel programming.

One of the motivations for parallel programming has been the diminishing marginal increases in single CPU performance with each new generation of Central Processing Unit (CPU). In response, computer makers have introduced multi-core processors that contain more than one processing core. Most desktop computers, laptops, and even tablets and smart phones have two or more CPU cores these days (e.g. devices are usually advertised as “duo-core” or “quad-core”). In addition to multi-core CPUs, Graphics Processing Units (GPU) have become more powerful recently. GPUs are increasingly being used not just for drawing graphics to the screen, but for general purpose computation.

By default, your Python, R or Matlab code will run on a single CPU. The level of complexity involved in farming that task out to multiple processors (i.e. multiple CPUs and perhaps even a GPU) depends on the nature of the code itself. If the code is “embarrassingly parallel” (yep, that’s the term they use) then the process is usually no more complicated than renaming your loop. In Matlab, for instance, you simply click a little icon to launch some “workers” (i.e. multiple CPUs) and then change the word “for” in your code to “parfor.” Simple as that.

A problem is embarrassingly parallel so long as there exists no dependency (or need for communication) between the parallel tasks. For instance, let’s say you’ve got a loop that calculates a particular statistical quantity for an array of temperature data from a European climate model, then an American model, Japanese model and finally an Australian model. You could easily farm that task out to the four CPUs on your quad-core laptop, because there is no dependency between each task – the calculation of the statistic for one model doesn’t require any information from the same calculation performed for the other models.

 

Step 3: Consider more and/or better hardware

Ok, so you’ve vectorised your code, re-written other slow loops in a low-level language (or not), and farmed off any embarrassingly parallel parts of the code to multiple processors on your machine. If your code is still too slow, then you’ve essentially reached the limit of your hardware (whether that be your personal laptop/desktop or the server in your local department/organisation) and you’ll need to consider running your code elsewhere. As a researcher your choices for elsewhere are either a supercomputing facility or cloud computing service, and for both there will probably be some process (and perhaps a fee) involved in applying for time/space.

In the case of cloud computing, you’re basically given access to lots of remote computers (in fact, you’ll probably have no idea where those computers are physically located) that are connected via a network (i.e. the internet). In many cases these computers are no better or more advanced than your personal laptop, however instead of being limited to one quad-core machine (for instance) you can now have lots of them. It’s not hard to imagine that this can be very useful for embarrassingly parallel problems like the one described earlier. There are about 50 climate models in the CMIP5 archive (i.e. many more than just a single European, American, Japanese and Australian model), but these could all be analysed at once with access to a dozen quad-core machines. There are tools like the Matlab Distributed Compute Service to help deal with the complexities associated with running code across multiple machines at once (i.e. a cluster), so it’s really not much more difficult than using multiple cores on your own laptop.

The one thing we haven’t considered so far is parallel computing for non-embarrassing problems like running a complex global climate model (as opposed to analysing the output). The calculation of the surface temperature at each latitude/longitude point as time progresses, for instance, depends on the temperature, humidity and sunshine on the days prior and also the temperature, humidity and sunshine at the surrounding grid-points. These are known as distributed computing problems, because each process happening in parallel needs to communicate/share information with other processes happening at the same time. Cloud computing isn’t great in this case, because the processors involved aren’t typically very close to one another and communication across the network is relatively slow. In a supercomputer on the other hand, all the processors are very close to one another and communication is very fast. There’s a real science behind distributed computing, and it typically requires a total re-think of the way your code and problem is structured/framed (i.e. you can’t just replace “for” with “parfor”). To cut a long story short, if your problem isn’t embarrassingly parallel and distributed computing is the answer to your problem, there won’t be a simple tool to help you out. You’re going to need professional assistance.

(Note: This isn’t to say that supercomputers aren’t good for embarrassingly parallel problems too. A supercomputer has thousands and thousands of cores – more than enough to solve most problems fast. It’s just that in theory you have access to more cores in cloud computing, because you can just keep adding more machines. If you’re dealing with the volumes of data that Amazon or Google do, then this is an important distinction between cloud and super-computing.)

 

Concluding remarks

If you’re a typical weather/climate scientist, then well vectorised code written in a high-level language will run fast enough for pretty much any research problem you’d ever want to tackle. In other words, for most use cases speed depends most critically on how the code is written, not what language it’s written in or how/where it’s run. However, if you do find yourself tackling more complex problems, it’s important to be aware of the options available for speeding up your code and the level of complexity involved. Farming out an embarrassingly parallel problem to the four CPUs on your local machine is probably worth the small amount of time and effort involved in setting it up, whereas applying for access to cloud computing before you’ve exhausted the options of vectorisation and local parallel computing would probably not be a wise investment of your time, particularly if the speed increases aren’t going to be significant.

May 15, 2014 / Damien Irving

A vision for data analysis in the weather and climate sciences

I’ve been a Software Carpentry instructor for a little over a year now, which is to say that I put aside a couple of days every now and then to teach basic software skills to scientists. The two-day “bootcamps” that we run are the brainchild of Greg Wilson, who has been teaching programming to scientists for over 15 years. I wasn’t there to see it in person, but a couple of weeks ago Greg gave a great talk at PyCon about the lessons he’s learned along the way. As you might imagine, his recorded talk and accompanying paper are chock-full of wonderful insights into how we might improve computational standards in the science community. In articulating my vision for data analysis in the weather and climate sciences, I wanted to focus on what Greg listed as his number one lesson:

“Most scientists think of programming as a tax they have to pay in order to do their science.”

In other words, most scientists could have majored in computer science if they wanted to, but instead pursued their passion in ecology, biology, genetics or meteorology. Among many other things, this means that scientists (a) don’t know anything about system admin, and (b) have no desire and/or time to learn. If it isn’t easy for them to get open source software installed and running on their machine, they’re not going to spend hours trawling terse developer discussion lists online. They’re either going to switch to a proprietary package like Matlab that is easy to install, or worse still give up on or modify the analysis they were planning to do. In short, if we want to foster a culture where scientists regularly participate in open and collaborative software development (which I’m sure is the evil master plan of the Mozilla Science Lab, which funds Software Carpentry), then as a first step we must solve the software installation problem.

I’m certainly not the first person to make this observation, and for general data analysis a company called Continuum Analytics has already solved the problem. Their (free) Anaconda product bundles together pretty much all the Python packages you could ever need for general data management, analysis, and visualisation. You literally just download the executable that matches your operating system and then, hey presto, everything is installed and working on your machine. Anaconda also comes with pip, which is the Python package manager that most people use to simply and easily install additional packages.

That’s all well and good for general data analysis, but in fields like the weather and climate sciences some of our work is highly specialised. It would be incredibly inefficient if every climate scientist had to take the general Python numerical library (numpy) and build upon it to write their own modules for calculating climatologies or seasonal anomalies, so there are community developed packages like UV-CDAT that do just that (the Climate Data Analysis Tools or CDAT part in particular includes a library called cdutil for calculating seasonal anomalies; see this post for an overview). In my own work there are also a couple of community developed packages called windspharm and eofs that I use to calculate wind related quantities (e.g. streamfunction, velocity potential) and to perform empirical orthogonal function analyses. These have been assimilated into the UV-CDAT framework, which means that when I run an interactive IPython session or execute a Python script, I’m able to import windspharm and eofs as well as the cdutil library. This makes my life much easier because these packages are not available via pip and I’d have no idea how to install CDAT, windspharm and eofs myself such that they were all available from the same IPython session.

At this point you can probably see where I’m going with this… UV-CDAT has the potential to become the Anaconda of the weather and climate sciences. The reason I preface this statement with ‘potential’ is that it’s not quite there yet. For example, the UK Met Office recently developed a package called Iris, which has built on the general Python plotting library (matplotlib) to make it much easier to create common weather and climate science plots like Hovmoller diagrams and geographic maps. Since it hasn’t been assimilated into the UV-CDAT framework, I have it installed separately on my machine. This means I cannot write a script that calculates the seasonal climatology using cdutil and then plots the output using Iris. I create all sorts of wacky work-arounds to cater for installation issues like this, but these ultimately cost me time and make it very difficult for me to make my code openly available with any publications I produce (i.e. I guess you can make a data processing workflow openly available when it depends on multiple separate Python installations, but it’s hardly user friendly!).

The UV-CDAT development team is certainly not to blame in this situation. On the contrary, they should be applauded for producing a package that has the potential to solve the installation issue in the weather and climate sciences. I also know that they are working to include more user contributed packages (e.g. aoslib), and it would be reasonable to expect that they would wait to see if Iris becomes popular before going to the effort of assimilating it. In my mind the key to UV-CDAT realising its potential is probably related to the fact that up until now it’s been much easier for them to get funding to develop the Ultrascale Visualisation (UV) part of the project, as opposed to the scripting environment. It’s a flashy graphical user interface for 3D data visualisation and has certainly made a substantial contribution to the processing of large datasets in the weather and climate sciences, however it’s only one piece of the puzzle. My vision (or hope) is that funding bodies begin to recognise that (a) the software installation issue is one of the major roadblocks to progress in the weather and climate sciences, and (b) UV-CDAT, much like Anaconda in the general data analysis sphere, is the vehicle that the community needs to rally around in order to overcome it. With ongoing financial support for the assimilation of new packages (including Iris in the first instance) and for updating/improving the documentation and user-support associated with the scripting environment, UV-CDAT could play a vitally important role in enabling a culture of regular and widespread community-scale software development in the weather and climate sciences.

May 1, 2014 / Damien Irving

What’s in your bag?

This post has been superseded by the toolbox page.

Warning: For those who aren’t into sports, try and push through the first paragraph – there’s a useful analogy I’m working towards, I promise!

If you’re a golfer, or if you know someone who is, you’ll be aware that there is a lot of equipment involved. The 14 clubs in a typical golf bag each have a very specific function. Some are for hitting the ball low and far, others high and short, out of sand, off a tee, on the putting green, etc. A common preoccupation among golf players involves checking out what clubs other people have in their bag. Which brand do they use? What grip thickness, shaft flexibility and club head size have they gone for? Do they think they impart more backspin on the ball with their new sand wedge, or their old one?

This phenomenon certainly isn’t restricted to golf players. Cyclists pore over each other’s bikes, while hikers can talk about their camping gear for hours. Likewise, when you attend a workshop or even a social event in the weather and climate sciences, conversation invariably turns to your research setup. What programming language, operating system and reference management software do you use? Are you a Microsoft Word or LaTeX person? What program do you use to edit your graphics? While this might seem like idle chit-chat, I’ve been alerted to some very useful software packages through conversations exactly like this. As such, I wanted to share my research setup:

Operating system: Ubuntu (desktop), OS X (laptop)

Data analysis and visualisation:

  • Simple manipulation of netCDF files (e.g. changing metadata/attributes, renaming variables): NCO
  • Simple data analysis tasks on netCDF files (e.g. temporal and spatial averaging): CDO
  • Quick viewing of netCDF files: UV-CDAT
  • Data analysis: xray, which is built on top of Pandas and NumPy
  • Plotting: Iris / Cartopy (geographic maps) and Seaborn (regular histograms, line graphs, etc), which are built on top of matplotlib and basemap
  • Workflow automation: Unix Shell, Make

Code development: IPython notebook

Source code editing: NEdit

Version control: Git (version control software), GitHub (hosting service)

Simple data processing: OpenOffice Calc

Word processing: OpenOffice Writer for short documents and LaTeX for papers and theses. I used to use the Texmaker editor for writing LaTeX documents, but increasingly I’m using an online editor called Authorea instead because it’s great for collaborative LaTeX editing (see my post about it here).

Reference management: Mendeley

Presentations: Microsoft PowerPoint, which I then make available via Speaker Deck

Graphics editing (including conference posters): Inkscape

Cloud storage/backup: SpiderOak

 

Golf players also tend to have a favourite and least favourite club in their bag, and always have their eye on the latest clubs for sale in the local pro shop. Similarly, with respect to my research setup:

  • I would love to find an alternative to using PowerPoint for presentations. The rest of my setup consists of free and open source software, but I haven’t been able to kick the PowerPoint habit. There are a number of LaTeX packages for creating presentations, as well as OpenOffice Impress and Prezi, so I really have no excuse.
  • I would like to make the move to vi for my source code editing. It allows you to edit code at the command line (which means it doesn’t require a pop-up window) and is available with all Linux distributions, so you can never get caught out when using a new/foreign computer.
  • Like many scientific Python programmers, I’ve also got my eye on Julia. High-level programming languages like Python and MATLAB are great because they’re easy for humans to read (which speeds up the code development process), but the trade-off is that they run much slower than low-level compiled languages like C or Fortran. Julia is a new language that’s the best of both worlds: a high-level (i.e. pretty), compiled (i.e. fast) language.

 

What’s in your bag at the moment, and what have you got your eye on in the pro shop?

April 20, 2014 / Damien Irving

Authorea: the future of scientific writing?

It’s fair to say that LaTeX has gained widespread acceptance as the tool of choice for writing scientific scholarly articles (if you need convincing, see here, here and here). In comparison to a typical “what you see is what you get” (WYSIWYG) editor like Microsoft Word or Apache OpenOffice, the most radical aspect of LaTeX is that you don’t immediately see how your document will be typeset. Documents are instead prepared by writing a plain text input file that includes markup commands to specify the formatting, before invoking the LaTeX program to (typically) generate a final PDF document.

Since the LaTeX software and associated text editors like Texmaker are free to download, most scientists do all their document preparation on their own computer. While this is a perfectly valid workflow, it fails to take advantage of the fact that we now live in a web-enabled and highly interconnected world. As Alberto Pepe noted in his presentation at the I Annotate 2013 conference in San Fransisco, today’s scientists are doing 21st century science, writing up using 20th century writing tools (e.g. Microsoft Word, LaTeX), then locking that text away in a 17th century format (i.e. the PDF of a journal paper today has much the same format and accessibility as a scanned copy of a journal article from hundreds of years ago).

In an attempt to bring scientific writing into the 21st century, a number of online LaTeX editors have begun to appear in recent years. An obvious advantage to online editing is that you don’t need to have LaTeX installed on your machine, however since installation is both free and relatively straightforward, this hardly represents a compelling reason for scientists to change the way they write. Instead, it’s the opportunity for collaboration and sharing that has the science community so excited about online LaTeX editing.

Online LaTeX editing is an area that many people are trying to innovate in at the moment (I came across a number of “sorry we’re shutting down because we couldn’t make any money” posts in researching this article), however the two editors that have gained the most traction are ShareLaTeX and writeLaTeX. In a nutshell, these editors are to LaTeX what Google Docs is to WYSIWYG editors like Word and OpenOffice. External collaborators can view and edit the document and there are comment and chat features for discussing changes. They also provide a kind of WYSIWYG functionality, as the PDF output can be generated alongside the text editor as you type. While this is an exciting step forward, the end result for a document on ShareLaTeX and writeLaTeX is still a PDF that locks away the text in that familiar 17th century (and not to mention proprietary) format. To fully exploit the advantages of the web, a different model is clearly needed.

Alberto Pepe and his co-founder Nathan Jenkins might just be on the way to establishing that new 21st century model for scientific writing and publishing. At the most basic level, their new website Authorea offers most* of the features that ShareLaTeX and writeLaTeX do in terms of collaborative editing and commenting. In fact, their referencing (just put in a DOI and it figures out the rest), backup (you can link to your GitHub account** instead of just DropBox), PDF / Microsoft Word export (pick a journal and it will format accordingly) and IPython notebook (you can include the code that was used to generate your figures – see here for an example) functionality represents a step forward on their competitors. At a higher level, Authorea provides a way forward from that 17th century publishing format. While it does allow authors to export to PDF (or Word), the most novel thing about Authorea is that it complies your LaTeX text (or Markdown, which may be the future of scholarly writing – see here) to HTML. Instead locking your text away in a proprietary format, Authorea makes it available on the web for anyone to view and comment on.

One of the most exciting advances in scientific publishing in the last few years has been the rise of pre-print servers like arXiv. Since there is such a long time-lapse between when authors submit a manuscript and when the work is finally published, scientists are posting their work (in PDF format) to a pre-print server as soon as it’s done, so that the wider community can read their work while it’s being reviewed. In a recent interview, Nathan revealed that his “crazy long-term dream” is to fundamentally change the way that pre-publishing works. His hope is that people will pre-publish with Authorea instead, whereby their article is available in HTML format and people can comment directly on different sections of the text. This model is essentially a mix between old-school PDF journal articles and new-age blog posts, which is why there’s so much buzz around Authorea at the moment (e.g. see reviews on AppStorm and AppVita).

I recently created an account for my PhD thesis, so why not give Authorea a try for your next journal paper or manuscript?

 
*It should be noted that with writeLaTeX and ShareLaTeX you basically get complete access to all of the markup commands and packages that LaTeX offers. Authorea makes some minor compromises in order to compile to HTML, which are outlined in this handy cheat sheet. In other words, they cater to the needs of the vast majority of LaTeX users, but if you’re in the advanced minority of users who require highly specific figure or output formatting options then writeLaTeX or ShareLaTeX might be a better option.

**ShareLaTeX has also just introduced a way to link with GitHub, however GitHiub is the only option. In principle, you can sync Authorea with any external hosting service that is compatible with git (e.g GitHub, BitBucket, SourceForge, etc)