Python and Software Carpentry: A 3-year journey
I was recently involved in hosting a Software Carpentry boot camp for a group of 70 weather/climate scientists. For the benefit of those who’ve never been, a boot camp is basically a two-day crash course in the core skills needed to be productive in a small research team: basic programming skills, version control, testing, and relational databases. The feedback I received immediately after the event suggested that everyone left on Friday afternoon with an understanding of, and real enthusiasm for, the practices and tools that will improve their scientific computing (the feedback also suggested that the coffee was terrible and the wifi sucked, but let’s not dwell on the negatives…).
While this feedback was very encouraging, I think we’ve all been to workshops where we’ve learned something cool, gotten all enthusiastic, but then returned to our desk on Monday morning and simply continued doing the same things we’ve always done. To combat this problem, we had hoped to finish the workshop with a presentation from an actual weather/climate scientist. We figured that people would be more inclined to change the way they do things, if one of their peers (me, in this case) demonstrated how they have incorporated the lessons of the boot camp into their daily work.
Unfortunately, the second-last session of the boot camp ran overtime, and I didn’t get a chance to give my presentation. What I had planned to present was a summary of my journey over the past few years, in learning Python and incorporating the principles of the Software Carpentry course into my daily work. In essence, that journey can be broken down into 4 key development stages:
Stage 1: Getting started
It was a work colleague who introduced me to Python a little over 3 years ago. Importantly, he introduced me to the Climate Data Analysis Tools (CDAT) package, which basically contains all the Python modules that you could ever need as a weather/climate scientist. Functions for reading/writing netCDF files, statistical analysis, data visualisation, etc, etc… the latest CDAT installation has it all. Everything you need to know about getting started with Python (and CDAT) in the weather/climate sciences can be found at this previous post.
Stage 2: Parsing the command line
Once you’ve successfully written a script, it’s very common to want to re-run that script, with just a slight variation. For instance, you might want to perform the same task but with a different input file, x-axis label or parameter in your statistical model. When I first started programming, I naively achieved these minor variations by editing my script directly. I soon realised, however, that it’s more desirable to simply ask the user to specify the input file or x-axis label (or any other option you care to dream up) at the command line, so that the code doesn’t need to be modified. This is known as ‘parsing’ the command line.
In Python, this can be achieved using the optparse module or the newer argparse module. An example of optparse in action is shown below – for more examples, feel free to browse the source code at my Bitbucket page.
Stage 3: Writing your own modules
One of the golden rules of programming, as emphasised at any Software Carpentry boot camp, is that you want avoid code duplication. Anytime a similar or identical section of code is used more than once, the chance of bugs skyrockets. For instance, I recently realised that many of my scripts begin with a similar section of code, where I read in, and perform a simple manipulation of, data from a netCDF file (e.g. I might extract a time period or spatial region of interest, or calculate the climatology or seasonal mean). Each script also usually ends with similar code, whereby an output netCDF file is written.
This duplication wasn’t a problem to begin with, however I began to run into problems as my knowledge of the cdms2 CDAT module improved (i.e. the module with functions for dealing with netCDF files). I would frequently learn about a cool new cdms2 function, update the script I was working on accordingly, but forget to update all my existing scripts with the same new function. Over time, a massive disparity began to emerge between the quality of code I had written recently and code I hadn’t looked at for months.
To get around this problem, I wrote my own netCDF input/output module (search “netcdf_io.py” at my Bitbucket page). I now simply “import netcdf_io” into all my scripts, and I never have to worry about code duplication again. So if you find yourself cutting and pasting a lot of code, it’s time to think about writing a module or two!
Stage 4: Full disclosure
I recently wrote an article for the Bulletin of the Australian Meteorological and Oceanographic Society (Irving, 2012), where I discussed the fact that computational science is moving rapidly towards an era of increased transparency. The journal Nature, for example, will soon require authors to submit their source code along with their manuscript. In order to try and keep pace (or even get ahead) of this transparency revolution, I’m trying to make my science as open as possible. For starters, all my source code is under version control and publically available on bitbucket. For all research papers that I write during my PhD, I also intend to create a RunMyCode.org companion page where all my data and source code will be posted (see a detailed post on this topic here).
I would eventually like to get to the point where I could contribute code to projects like CDAT, just like Andrew Dawson (climate science postdoc at the University of Oxford) has done with his eof2 and windspharm modules. In a way I see this as the ultimate test of a Software Carpentry boot camp graduate, as it requires all the skills and knowledge taught over the two days.