Skip to content
August 28, 2014 / Damien Irving

Speeding up your code

In today’s modern world of big data and high resolution numerical models, it’s pretty easy to write a data analysis script that would take days/weeks (or even longer) to run on your personal (or departmental) computer. With buzz words like high performance computing, cloud computing, vectorisation, supercomputing and parallel programming floating around, what’s not so easy is figuring out the best course of action for speeding up that code. This post is my attempt to make sense of all the options…

 

Step 1: Vectorisation

The first thing to do with any slow script is to use a profiling tool to locate exactly which part/s of the code are taking so long to run. All programming languages have profilers, and they’re normally pretty simple to use. If your code is written in a high-level language like Python, R or Matlab, then the bottleneck is most likely a loop of some sort (e.g. a “for” or “while” loop). These languages are designed such that the associated code is relatively concise and easy for humans to read (which speeds up the code development process), but the trade-off is that they’re relatively slow for computers to run, especially when it comes to looping.

If a slow loop is at fault, then your first course of action should be to see if that loop can be vectorised. For instance, let’s say you’ve got some temperature data on a time/latitude/longitude grid, and you want to convert all the values from Kelvin to Celsius. You could loop through the three dimensional grid and subtract 273.15 at each point, however if your data came from a high resolution global climate model then this could take a while. Instead, you should take advantage of the fact that your high-level programming language almost certainly supports vectorised operations (aka. array programming). This means there will be a way to apply the same operation to an entire array of data at once, without looping through each element one by one. In Python, the NumPy extension supports array operations. Under the hood NumPy is actually written in a low-level language called C, but as a user you never see or interact with that C code (which is lucky because low-level code is terse and not so human friendly). You simply benefit from the speed of C (and the years of science that has gone into optimising array operations), with the convenience of coding in Python.

If you get creative – and there are lots of useful NumPy routines (and their equivalents in other languages) out there to help with this – almost any loop can be vectorised. If for some reason your loop isn’t amenable to vectorisation (e.g. you might be making quite complicated decisions at each grid point using lots of “if” statements), then another option would be to re-write that section of code in a low-level language like C or Fortran. Most high-level languages allow you to interact with functions written in C or Fortran, so you can then incorporate that function into your code.

 

Step 2: Exploit the full capabilities of your hardware

Ok, so let’s say you’ve vectorised all the loops that you can, you’ve written any other slow loops in a low-level language, and your code is still too slow (or perhaps you can’t be bothered learning a low-level language and have skipped that option, which is totally understandable). The next thing to try is parallel programming.

One of the motivations for parallel programming has been the diminishing marginal increases in single CPU performance with each new generation of Central Processing Unit (CPU). In response, computer makers have introduced multi-core processors that contain more than one processing core. Most desktop computers, laptops, and even tablets and smart phones have two or more CPU cores these days (e.g. devices are usually advertised as “duo-core” or “quad-core”). In addition to multi-core CPUs, Graphics Processing Units (GPU) have become more powerful recently. GPUs are increasingly being used not just for drawing graphics to the screen, but for general purpose computation.

By default, your Python, R or Matlab code will run on a single CPU. The level of complexity involved in farming that task out to multiple processors (i.e. multiple CPUs and perhaps even a GPU) depends on the nature of the code itself. If the code is “embarrassingly parallel” (yep, that’s the term they use) then the process is usually no more complicated than renaming your loop. In Matlab, for instance, you simply click a little icon to launch some “workers” (i.e. multiple CPUs) and then change the word “for” in your code to “parfor.” Simple as that.

A problem is embarrassingly parallel so long as there exists no dependency (or need for communication) between the parallel tasks. For instance, let’s say you’ve got a loop that calculates a particular statistical quantity for an array of temperature data from a European climate model, then an American model, Japanese model and finally an Australian model. You could easily farm that task out to the four CPUs on your quad-core laptop, because there is no dependency between each task – the calculation of the statistic for one model doesn’t require any information from the same calculation performed for the other models.

 

Step 3: Consider more and/or better hardware

Ok, so you’ve vectorised your code, re-written other slow loops in a low-level language (or not), and farmed off any embarrassingly parallel parts of the code to multiple processors on your machine. If your code is still too slow, then you’ve essentially reached the limit of your hardware (whether that be your personal laptop/desktop or the server in your local department/organisation) and you’ll need to consider running your code elsewhere. As a researcher your choices for elsewhere are either a supercomputing facility or cloud computing service, and for both there will probably be some process (and perhaps a fee) involved in applying for time/space.

In the case of cloud computing, you’re basically given access to lots of remote computers (in fact, you’ll probably have no idea where those computers are physically located) that are connected via a network (i.e. the internet). In many cases these computers are no better or more advanced than your personal laptop, however instead of being limited to one quad-core machine (for instance) you can now have lots of them. It’s not hard to imagine that this can be very useful for embarrassingly parallel problems like the one described earlier. There are about 50 climate models in the CMIP5 archive (i.e. many more than just a single European, American, Japanese and Australian model), but these could all be analysed at once with access to a dozen quad-core machines. There are tools like the Matlab Distributed Compute Service to help deal with the complexities associated with running code across multiple machines at once (i.e. a cluster), so it’s really not much more difficult than using multiple cores on your own laptop.

The one thing we haven’t considered so far is parallel computing for non-embarrassing problems like running a complex global climate model (as opposed to analysing the output). The calculation of the surface temperature at each latitude/longitude point as time progresses, for instance, depends on the temperature, humidity and sunshine on the days prior and also the temperature, humidity and sunshine at the surrounding grid-points. These are known as distributed computing problems, because each process happening in parallel needs to communicate/share information with other processes happening at the same time. Cloud computing isn’t great in this case, because the processors involved aren’t typically very close to one another and communication across the network is relatively slow. In a supercomputer on the other hand, all the processors are very close to one another and communication is very fast. There’s a real science behind distributed computing, and it typically requires a total re-think of the way your code and problem is structured/framed (i.e. you can’t just replace “for” with “parfor”). To cut a long story short, if your problem isn’t embarrassingly parallel and distributed computing is the answer to your problem, there won’t be a simple tool to help you out. You’re going to need professional assistance.

(Note: This isn’t to say that supercomputers aren’t good for embarrassingly parallel problems too. A supercomputer has thousands and thousands of cores – more than enough to solve most problems fast. It’s just that in theory you have access to more cores in cloud computing, because you can just keep adding more machines. If you’re dealing with the volumes of data that Amazon or Google do, then this is an important distinction between cloud and super-computing.)

 

Concluding remarks

If you’re a typical weather/climate scientist, then well vectorised code written in a high-level language will run fast enough for pretty much any research problem you’d ever want to tackle. In other words, for most use cases speed depends most critically on how the code is written, not what language it’s written in or how/where it’s run. However, if you do find yourself tackling more complex problems, it’s important to be aware of the options available for speeding up your code and the level of complexity involved. Farming out an embarrassingly parallel problem to the four CPUs on your local machine is probably worth the small amount of time and effort involved in setting it up, whereas applying for access to cloud computing before you’ve exhausted the options of vectorisation and local parallel computing would probably not be a wise investment of your time, particularly if the speed increases aren’t going to be significant.

Advertisements

4 Comments

Leave a Comment
  1. Damien Irving / Aug 28 2014 12:13

    Additional note: In focusing on speed, one issue I haven’t considered in this post is that of memory. In many cases people analysing model data are limited by the fact that their input data array (e.g. 4D [time/latitude/longitude/level] over the entire globe) is too large for their computer to hold in memory. If your code runs fast enough, then the best solution is to simply break up the task into smaller temporal or spatial chunks. Supercomputing is also a useful solution in this case, because you’ll then be working on a machine with more RAM (you may also be able to get access to a machine with more RAM via cloud computing).

  2. hypergeometric / Aug 28 2014 13:24

    Nice post. However, the optimizations mentioned will produce execution times which are constant multiples of what they were previously. That can be significant.

    However, much higher reductions in computation time can be had by reformulating the problem so it is of a lower order of computational complexity, and this demands rethinking the problem at the level of the problem.

    Even moreso, for those of us who do Bayesian hierarchical models, the time to completion of a run depends upon when the Gibbs sampler or MCMC engine converges, and that depends on how well-suited the model is for the data and process being studied. Nothing serves as well as a model innovation which speeds convergence. That said, it is important not to overly compromise. If the facts of the problem say a certain model is appropriate, the analyst needs to demand the model retains certain features which logic demands, however expensive computations are. Nevertheless, if a dataset is so computationally incompatible with a model, and the model is known to be physically realistic, that raises questions regarding the quality of the data, and perhaps those need to be either addressed or modeled as well as the physical or sampling model.

  3. Drew Whitehouse / Sep 9 2014 11:57

    Great article Damien. It’s probably worth pointing out that the term “vectorisation” in the context of supercomputing is a slightly different concept than what you’ve described here. It involves structuring code so that it is amenable to vectorisation on specialised hardware. Eg SSE instructions on x86 processors, but more traditionally it was for vector units on supercomputers. See http://en.wikipedia.org/wiki/Vector_processor

  4. hypergeometric / Sep 10 2014 02:39

    Reblogged this on Hypergeometric and commented:
    Some of the actual as opposed to imagined needs for concurrent computing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: