Skip to content
October 10, 2013 / Damien Irving

Testing your code

The Climate Institute recently published a series of short interviews, where they asked a bunch of climate scientists about what keeps them up at night. As you would expect, most of the answers were about sea level rise, heat waves and other projected changes to the climate system. One respondent, however, jokingly remarked that the possibility of bugs in her code was also a major source of stress. While this comment was made in jest, there is no doubt that one of the biggest fears for any scientist is to have someone find an error in your published computational results.

The highest profile case of buggy scientific code occurred back in 2007, involving a biologist at the University of California (Miller, 2007). The bug in question did nothing more than inadvertently flip a single column of data, however its identification resulted in the retraction of five published papers, three of which appeared in Science. These things happen in the weather/climate sciences as well, with a recent Nature article (Ince et al, 2012) highlighting a case back in 2009, where a bug was found to be responsible for substantial errors in the widely used HadCRUT surface temperature dataset. In fact, while the climate science community was exonerated of any wrongdoing in the wake of the IPCC email hacking scandal, a Nature commentary on scientific programming standards rightly pointed out that the email affair was a warning to all scientists to get their houses in order (Merali, 2010):

“To all scientists out there, ask yourselves what you would do if, tomorrow, some Republican senator trains the spotlight on you and decides to turn you into a political football. Could your code stand up to attack?”

Given the ramifications of publishing erroneous computational results, you’d think scientists would be the most conscientious code testers going around. Unfortunately, this could not be farther from the truth. The development process followed by many scientists (myself included until very recently) goes something like this: write a chunk of code, check (in a rather ad hoc fashion) whether the output ‘looks’ as you might expect, then move on to the next chunk. The lack of any real system to this approach is problematic for a number of reasons:

  1. In six months time you won’t be able to remember if (or how thoroughly) you tested the code
  2. Upon altering a particular section of the code, you have no way of quickly checking if all the other dependent sections still work properly
  3. There’s a good chance that you haven’t even come close to testing the full range of possible modes of failure

Since most scientists spend their days meticulously testing and re-testing research hypotheses, it seems a little odd that their code testing would be so lax. Perhaps time pressures have something to do with it, but in the long run code testing actually saves time, so that’s probably not it. Instead, I think the problem arises because most scientists are self-taught programmers. They spend little (if any) time interacting with professional programmers, which means they are simply unaware of the standard testing practices used in the software development industry. This post is an attempt to summarise those practices, as they relate to scientific computing.


To test or not to test?

This section has been adapted from the ‘Software Quality’ chapter of the Software Carpentry Instructors Guide.

It goes without saying that a complex program requires a much higher investment in testing than a simple one. A short script that is only going to be used once, to produce a single figure, probably doesn’t need separate testing: its output is either correct or not. On the other hand, consider a hypothetical script you’re writing to perform spatial interpolation. None of your favourite off-the-shelf packages (e.g. CDO) provide a function for this particular type of interpolation, which is a bummer because you don’t need to test those, so you’re writing it from scratch. You’re also going to be executing this code thousands of times to process the output from the global climate model you’re running. It’s pretty obvious that this code will require thorough testing.

Once the decision to test has been made, it’s important to understand that testing can only do so much. Suppose you are testing a function that compares two 7-digit phone numbers. There are 107 such numbers, which means that there are 1014 possible test cases. At a million tests per second, it would take 155 days to run them all. And that’s only one simple function: exhaustively testing a real program with hundreds or thousands of functions, each taking half a dozen arguments, would take many times longer than the expected lifetime of the universe. And how would you actually write 1014 tests? More importantly, how would you check that the tests themselves were all correct?

In reality, all that testing can do is show that there might be a problem in a piece of code. If testing doesn’t find a failure, there could still be bugs lurking that just weren’t picked up. And if testing says there is a problem, it could well be a problem with the test rather than the program.

So why test? Because it’s one of those things that shouldn’t work in theory, but is surprisingly effective in practice. It’s just like mathematics: any theorem proof might contain a flaw that just hasn’t been noticed yet, but somehow we manage to make progress.


Introducing the unit test…

The core of code testing is the unit test, which checks the correctness of a single unit of software (typically a single function or method). There are excellent off-the-self unit testing libraries available for pretty much all programming languages, with some languages like Python offering multiple different options (e.g. see the Hitchhikers Guide to Python for a run-down). Making sense of all the options can be a little confusing, however the basic premise for most of them goes something like this:

  1. Write a bunch of tests (i.e. small functions) that each culminate in an assertion (true/false statement)
  2. Store these test functions in a file that is completely separate from the code you are testing (e.g. if you’re testing, store the tests in
  3. Instruct the unit testing library to execute all the tests in the file. It will then report their success/failure to the screen.

You can see that now contains a complete history of the testing you’ve done. If you make a change to or come back 6-months later and can’t remember if you’ve tested it, you can simply re-run the unit testing library over to check that the code isn’t broken. The former is commonly known as regression testing – the practice of re-running pre-existing tests after changes have been made to the code, in order to make sure it hasn’t regressed.


What should we be testing?

Now that we know how to unit test, we need to consider what to test. In other words, how do we validate and verify a program? Verification basically asks whether our program is free of bugs, while validation considers if we are implementing the right model or building the right thing. The latter obviously depends on the specific science you’re doing, and typically isn’t something a formal unit test would be written for. For verification testing on the other hand, there are a few generic types of unit test:

  1. Tests for success: Does the code give the correct answer?
  2. Tests for failure: Not only should a function succeed when given good input, it should also fail (in the way you would expect it to fail) when given bad input.
  3. Tests for sanity: Often a unit of code will contain a set of reciprocal functions (e.g. where one converts A to B and the other converts B to A). In these cases, it is useful to create a ‘sanity check’ to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering any other sort of bug.

As well as writing tests for all the individual units (or functions) that are the building blocks of your code, it’s also important to write tests for those functions further up the food chain (i.e. the functions that join together all the smaller functions in order to achieve the task at hand). This process of checking if all the units work properly together is called integration testing.

From personal experience, I’ve found that integration tests don’t always naturally culminate in a simple assertion. Instead, they sometimes culminate in a plot that requires visual inspection. This kind of makes sense when you think about it, since a plot is often the end product of an analysis. However, since the checking of a plot can’t be automated like assertion testing, you don’t want to find yourself writing too many tests like this.

So now we know that we should be writing tests for success, failure, sanity, and integration. The next question is, how many tests should we be writing? As previously discussed, for our 7-digit phone number function there was a staggering 1014 possible tests for success, which is obviously a little impractical. Fortunately, the answer is common sense: of all the possible tests out there, we only want those that are most likely to give us useful information that we don’t already have. We should therefore try to choose tests that are as different from each other as possible, so that we force the code we’re testing to execute in all the different ways it can. Another way of thinking about this is that we should try to find boundary cases. If a function works for zero, one, and a million values, it will probably work for all values in between.


Code coverage

Besides “unit test,” the next most common phrase you’ll hear programmers throwing around in relation to testing is “code coverage.” Rather confusingly, this can refer to the coverage of the unit tests they’ve written, or the coverage of the application code itself. With respect to the former, the code coverage of a set of tests refers to the percentage of the application code those tests exercise. It is often used as a rough indication of how well tested a piece of software is – if the coverage of a set of tests is less than 100%, then some lines of code aren’t being tested at all. However, even 100% coverage doesn’t guarantee that the code has been completely tested. You can have situations where even though every line is exercised at least once, there are paths through the code that aren’t ever taken (i.e. there is a difference between line coverage and path coverage). While code coverage isn’t sufficient to show that everything has been tested, it’s still useful. Most languages have tools to test the coverage of your tests – the simplest tool in Python, for instance, is

The coverage of the actual code itself (as opposed to the coverage of the tests), considers whether all the functions and variables have been used. In other words, coverage of less than 100% suggests that you’ve created an unused function or imported an unused module. As you might have guessed, there are tools to check for this type of code coverage as well. For Python programming I use Pylint, which was discussed in a previous post.


Test driven development (TDD)

TDD refers to the concept of writing your tests before writing the actual application code. This may seem backward, but the process of writing the tests helps to clarify the purpose of the code in your own mind (i.e. it serves as a design aid), and also helps ensure that tests actually get written. Some programmers swear by TDD and follow it very closely, while others are not so enthusiastic. Unfortunately the literature on the topic doesn’t provide much clarity either – a meta-study in 2012 didn’t find TDD to have a significant impact on programmer productivity, however that might just be because we don’t really know how to measure programmer productivity. Whether you’re convinced by the TDD advocates or not, it has enough support to suggest that (a) it’s probably something you should try at least once, and (b) testing, whether it be before, during, or after the fact, should be a fundamental cornerstone of what you’re doing.



The sad fact is that bugs are part and parcel of programming, much like road accidents are part and parcel of driving a car. Even if you always stick to the speed limit and drive ultra-conservatively, there’s no guarantee that you won’t one day have an accident. Similarly, even if you follow the advice of this post to a tee, there’s no guarantee that you won’t one day publish a computation-related retraction (although hopefully not for an article in Science!). You will, however, greatly reduce the risks. You’ll also find collaboration much easier because let’s face it, nobody likes getting in a car with an erratic, lead-footed driver.



Leave a Comment
  1. Raniere SIlva / Oct 24 2013 10:46

    Congratulations for the post. It is the best abstract of testing code I ever read.

  2. David Ketcheson / Nov 24 2013 06:30

    Excellent post!

    I just have a tiny complaint: I wouldn’t compare bugs with auto accidents. Most drivers go a long time before having their first accident (if they ever have one). But every programmer starts off by writing many bugs. And continues to do so frequently throughout his/her career.

    • Damien Irving / Dec 19 2013 08:12

      Good point – I probably need a better analogy… any ideas?

  3. Damien Irving / Dec 19 2013 08:13

    A nice summary of the practicalities of doing unit testing in Python can be found here:

  4. Damien Irving / Sep 22 2016 14:46

    Katy Huff has written an entire Software Carpentry lesson on testing and continuous integration with Python:


  1. A beginners guide to scripting with UV-CDAT | Dr Climate

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: