Skip to content
April 6, 2015 / Damien Irving

Workflow automation

In previous posts (e.g. What’s in your bag?) I’ve discussed the various tools I use for data analysis. I use NCO for making simple edits to the attributes of netCDF files, CDO for routine calculations on netCDF files and a whole range of Python libraries for doing more complicated analysis and visualisation. In years gone by I’ve also included NCL and Fortran in the mix. Such diversity is pretty common (i.e. almost nobody uses a single programming language or tool for all their analysis) so this post is my attempt at an overview of workflow automation. In other words, how should one go about tying together the various tools they use to produce a coherent, repeatable data analysis pipeline?

The first thing to note is that the community has not converged on a single best method for workflow automation. Instead, there appears to be three broad options depending on the complexity of your workflow and the details of the tools involved:

  1. Write programs that act like any other command line tool and then combine them with a shell script or build manager
  2. Use an off-the-shelf workflow management system
  3. Write down the processing steps in a lab notebook and re-execute them manually

Let’s consider these approaches one by one:

 

1. Command line

Despite the fact that its commands are infuriatingly terse and cryptic, the Unix shell has been around longer than most of its users have been alive. It has survived so long because of the ease with which (a) repetitive tasks can be automated and (b) existing programs can be combined in new ways. Given that NCO and CDO are command line tools (i.e. you’re probably going to be using the command line anyway), it could be argued that the command line is the most natural home for workflow automation in the weather and climate sciences. For instance, in order to integrate my Python scripts with the rest of my workflow, I use the argparse library to make those scripts act like any other command line program. They can be executed from the command line, ingest arguments and options from the command line, combine themselves with other command line programs via pipes and filters, and output help information just like a regular command line program.

Armed with my collection of command line-native Python scripts, the easiest way to link multiple processing steps is to store them in a shell script. For instance, I could execute the following hypothetical workflow by storing all the steps (i.e. command line entries) in a shell script called run-streamfunction.sh.

  1. Edit the “units” attribute of the original zonal and meridional wind data netCDF files (NCO)
  2. Calculate the streamfunction from the zonal and meridional wind (calc_streamfunction.py)
  3. Calculate the streamfunction anomaly by subtracting the climatological mean at each timestep (CDO)
  4. Apply a 30 day running mean to the streamfunction anomaly data (CDO)
  5. Plot the average streamfunction anomaly for a time period of interest (plot_streamfunction.py)

This would be a perfectly valid approach if I was dealing with a small dataset, but let’s say I wanted to process 6 hourly data from the JRA-55 reanalysis dataset over the period 1958-2014 for the entire globe. The calc_streamfunction.py script I wrote would take days to run on the server in my department in this situation, so I’d rather not execute every single step in run-streamfunction.sh every time I change the time period used for the final plot. What I need is a build manager – a smarter version of run-streamfunction that can figure out whether previous steps have already been executed and if they need to be updated.

The most widely used build manager on Unix and its derivatives is called Make. Like the Unix shell it is old, cryptic and idiosyncratic, but it’s also fast, free and well-documented, which means it has stood the test of time. I started using Make to manage my workflows about a year ago and it has revolutionised the way I work. I like it because of the documentation and also the fact that it’s available no matter what machine I’m on, however there are other options (e.g. doit, makeflow, snakemake, ruffus) if you’d like something a little less cryptic.

To learn how to apply the command line approach to your own workflow automation, check out these Software Carpentry lessons:

 

2. Workflow management systems

The command line discussion above suggests the use of shell scripts for automating small, simple data processing pipelines, and build managers like Make and doit for pipelines that are either slightly more complicated or have steps that you’d rather not repeat unnecessarily (e.g. steps that take many hours/days to run). For many weather and climate scientists (myself included), this is as far as you’ll need to go. Make and doit have all the functionality you’ll ever really need for automatically executing a data analysis pipeline, and by following the process laid out in the data management lesson linked to above you’ll be able to document that pipeline (i.e. produce a record of the provenance of your data).

But what if you’re working on a project that is too big and complicated for a simple Makefile or two? The management of uber-complex workflows such as those associated with running coupled climate models or processing the whole CMIP5 data archive can benefit greatly from specialised workflow management systems like VisTrails, pyRDM, Sumatra or Pegasus. These systems can do things like manage resource allocation for parallel computing tasks, execute steps that aren’t run from the command line, automatically publish data to a repository like Figshare and produce nice flowcharts and web interfaces to visualise the entire workflow.

I’ve never used one of these systems, so I’d love to hear from anyone who has. In particular, I’m curious to know whether such tools could be used for smaller/simpler workflows, or whether the overhead associated with setting up and learning the system cancels out any benefit over simpler options like Make and doit.

 

3. The semi-manual approach

While writing command line programs is a relatively simple and natural thing to do in Python, it’s not the type of workflow that is typically adopted by users of more self-contained environments like MATLAB and IDL. From my limited experience/exposure, it appears that users of these environments tend not to link the data processing that happens within MATLAB or IDL with processing that happens outside of it. For instance, they might pre-process their data using NCO or CDO at the command line, before feeding the resulting data files into MATLAB or IDL to perform additional data processing and visualisation. This break in the workflow implies that some manual intervention is required to check whether previous processing steps need to be executed and to initiate the next step in the process (i.e. something that Make or doit would do automatically). Manual checking and initiation is not particularly problematic for smaller pipelines, but can be error prone (and time consuming) as workflows get larger and more complex.

Since I’m by no means an expert in MATLAB or IDL, I’d love to hear how regular users of those tools manage their workflows.

Advertisements

3 Comments

Leave a Comment
  1. Jason / Oct 14 2015 06:23

    There are so much more for me to learn!
    I’ve tried to implement some sort of command line manner by using the argparse module in python, I wanted to structure my functional python scripts so that they can be chained in the command line through pipes, and after a few chains the final plotting script generates a plot. I didn’t figure out a way to pipe a numpy array, so instead the workaround was to save the intermediate result to a tmp location, and pipe the file path to the next script. Probably not the best way but it worked.

    However I stopped going further along that route, after creating a few toy scripts which do basically what NCO and CDO are designed to do. The problem I found is, it is more difficult to debug these chains, particularly when more complicated analyses are involved. I feel more secured everything is done in python and I can break at any point, print something to the screen to make sure nothing crazy is going on. While chaining up in the shell script feels like firing a missile.

    • Darren / Oct 27 2015 08:01

      The philosophy behind command line pipelines is that each piece of the pipeline would be tested (and is easily testable) individually and has clearly defined input/output characteristics, thus when used in a pipeline it is easy to reason about and believe the full pipeline is correct. It’s kind of like setting a breakpoint and dumping to the screen in Python where each “breakpoint” is the end of one of your pipeline utilities and the “dump to screen” is the default behavior when you run that program by itself.

Trackbacks

  1. The CWSLab workflow tool: an experiment in community code development | Dr Climate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: