Skip to content
April 16, 2013 / Damien Irving

Backing up your work

One of the servers in my department crashed and subsequently died last week. As I didn’t personally use that server for any of my work, I first became aware of the crash via a generic department-wide email. My initial reaction to the news was fairly low key. I figured there would be some short term inconvenience for users of the server as they re-downloaded and re-processed their lost data files, but no harm done in the long run. I mean, everyone backs up their source code, right?

Wrong. At morning tea, horror stories began to emerge. Some users hadn’t backed up their code for months, while others didn’t have a single backup copy. I sat there drinking my coffee, dumbfounded. My backup habits border on obsessive compulsive. Programming is such a tedious and time consuming task that I can’t stand the thought of repeating even a single minute of it. I always backup at the end of the day, often backup prior to going for lunch, and sometimes even do a quick backup before taking a toilet break! Many people offered the excuse that they had incorrectly assumed that the IT staff were conducting some kind of regular backup. Even if this were true (as it is in many workplaces), relying on the IT guys to do your backups is kind of like giving your passport to your Mum for safekeeping. As a responsible adult (or research scientist) you should really take personal responsibility for looking after critically important documents (or code).

I suspect that one reason why people don’t regularly backup their work is that they aren’t aware of how easy it is these days. While everyone’s backup needs will be slightly different, here’s my take on backing up code, data and other general files.

 
1. Code

I’ve spoken previously about the fundamental importance of having all your code under version control. I won’t go into the details again here, but version control systems like svn or git basically allow you to keep a complete revision history of your code, so that you can retrieve any previous version at any time. In the context of avoiding disaster if/when your local server crashes, the most important thing is to have a copy of your svn or git code repository stored on an external hosting service. That service could be as simple as a USB drive that you keep on your key chain, however there are also many free online hosting services out there (e.g. Bitbucket, GitHub) – see my previous post for details. Unlike your USB drive, these online services make it really easy to collaborate with others and share code, with additional features like wiki pages and issue/bug tracking systems.

The reason I can be so obsessive compulsive about my backups, is that version control systems like svn and git make it ridiculously quick and easy. For example, to backup my work at the end of the day, I simply type three commands: ‘git add’ and ‘git commit’ to commit my changes to the repository, then ‘git push’ to sync those changes with my externally hosted bitbucket repository. If I arrived at work the next day to find that the local server had died, I’d simply type ‘git pull’ to check out a new copy of my code repository from Bitbucket. Problem solved.

In addition to source code, it’s also ideal to have an externally backed up revision history of manuscripts that you’re working on (e.g. a thesis or journal paper). There’s nothing worse than editing/deleting a paragraph in the editing process, only to realise a week later that you actually liked it better the way it was. Since version control systems like svn and git were originally designed to store code, they are primarily set up to handle text files. This is great news if you use LaTeX for your word processing, but what about if you wanted to track a Microsoft Word/Excel/PowerPoint or OpenOffice document? It turns out that in most cases this can be done (just Google it to find out how) – you normally just need to adjust some settings.

All other files that you deal with were probably either created from the source code / manuscripts that you’ve got under version control, downloaded from an external source, or didn’t take very long to create (excluding original data files, which are discussed in Section 3). As such, you don’t really need a full revision history of these files. In fact, strictly speaking it isn’t critical to back them up at all. However, it would be a pain to lose them, so a simple backup of the latest version of these files would be nice…

 
2. Other files

For backing up the latest version of important files, you could simply copy them across to that USB drive on your key chain. Alternatively (or additionally), what many people do is use an online cloud storage service. These services are often set up to make tasks like sharing files between multiple people, or syncing files across multiple personal devices (e.g. your laptop, desktop and ipad), really easy. There are many of these services out there, so it’s probably best to do a Google search to find the one that best suits your needs. Dropbox is probably the most well known, while I personally use one called SpiderOak. You can normally get a few Gigabytes of free storage upon signing up, and through random extra storage offers posted on Twitter and Facebook, I’ve managed to get my free storage up to 9 GB. This is more than enough space for all my images, PDF and Microsoft Word/Excel/PowerPoint files, however it certainly isn’t enough to store a complete dataset or two…

 
3. Data

In the weather/climate sciences, the starting point for any analysis is typically a collection of data files. If you’re running an atmospheric model, then these files might contain the sea surface temperature data required at the lower boundary of the model. If you’re studying the variability of the climate system, they might represent the output of a climate model or observations taken from a weather station. Whatever the case may be, these “starting point files” are brand new out of the package – you haven’t manipulated the values in them at all.

If your starting point files can be easily downloaded (e.g. reanalysis or CMIP5 data), then it’s probably not critical that you have a personal backup. Since you can simply download another copy if your computer crashes, all you really need is a record of your previous downloads (i.e. what version of the dataset you downloaded and from where). If you created your starting point files from scratch on the other hand (e.g. you took weather observations and stored them in a netCDF file), then a backup is absolutely critical. If the files are small, then you can probably use the same backup approach as for other general files (see Section 2). Otherwise, you might need to either speak to IT support about getting some extra storage space on a disk that is backed up, or purchase an external hard drive.

Any non-starting point data files can be re-created from your externally backed up source code, so you technically don’t need to back these up. However, if they took months to create (e.g. a high resolution coupled global climate model simulation), then you might want to treat them the same as starting point data.

 
If anyone has any other tips or tricks they use for backing up their work, feel free to post a comment!

Advertisements

4 Comments

Leave a Comment
  1. Stewart Allen / Apr 16 2013 10:24

    version tracking with MS office tools – especially using their .*x formats (eg. .docx) is… not straight forward, it would seem. Any suggestions?

    Actually, I’ve not thought about this until your post, but ‘versioning’ (rather than ‘tracking changes’) really is a massive feature hole in MS office tools….

  2. Luke Garde / Apr 16 2013 22:08

    Hi Stewart. Agree with your comments. Did a quick search and found http://magnetsvn.com/. I have not used this software. Might be worth a look?

  3. drclimate / Apr 17 2013 17:05

    Hey Stewart. I’m a LaTeX user, so I actually wasn’t even aware that version control systems like svn or git could track Microsoft Office files until I starting writing this post. From my research, I came to the same general conclusion that you have. It seems that git, for instance, is pretty good with older Microsoft versions (e.g. you can simply alter the settings to track .doc files; e.g. http://git-scm.com/book/en/Customizing-Git-Git-Attributes), but possibly hasn’t yet caught up with the newer Microsoft versions (e.g. .docx). Luke’s suggestion looks promising, however it isn’t free – I’d be inclined to wait until .docx can just be handled in the same way that .doc can.

Trackbacks

  1. Managing your data | Dr Climate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: