Skip to content
September 4, 2015 / Damien Irving

Managing your data

If you’re working on a project that involves collecting (e.g. from a network of weather stations) or generating (e.g. running a model) data, then it’s likely that one of the first things you did was develop a data management plan. Many funding agencies (e.g. the National Science Foundation) actually formally require this, and such plans usually involve outlining your practices for collecting, organising, backing up, and storing the data you’ll be generating.

What many people don’t realise is that even if you aren’t collecting or generating your own data (e.g. you might simply download a reanalysis or CMIP5 dataset), you should still start your project by developing a data management plan. That plan obviously doesn’t need to consider everything a data collection/generation project does (e.g. you don’t need to think about archiving the data at a site like Figshare), but there are a few key things all data analysis projects need to consider, regardless of whether they collected and/or generated the original data or not.
1. Data Reference Syntax

The first thing to define is your Data Reference Syntax (DRS) – a convention for naming your files. As an example, let’s look at a file from the data archive managed by Australia’s Integrated Marine Observing System (IMOS).


That’s a lot of information to take in, so let’s focus on the structure of the file directory first:


From this we can deduce, without even inspecting the contents of the file, that we have data from the IMOS project that is run by the eMarine Information Infrastructure (eMII). It was collected in 2012 at the Turquoise Coast, Western Australia (TURQ) site of the Australian Coastal Ocean Radar Network (ACORN), which is a network of high frequency radars that measure the ocean surface current. The data type has a sub-DRS of its own, which tells us that the data represents the 1-hourly average surface current for a single month (October 2012), and that it is archived on a regularly spaced spatial grid and has not been quality controlled. The file is located in the “demos” directory, as it has been generated for the purpose of providing an example for users at the very helpful Australian Ocean Data Network user code library.

Just in case the file gets separated from this informative directory structure, much of the information is repeated in the file name itself, along with some more detailed information about the start and end time of the data, and the last time the file was modified:


In the first instance this level of detail seems like a bit of overkill, but consider the scope of the IMOS data archive. It is the final resting place for data collected by the entire national array of oceanographic observing equipment in Australia, which monitors the open oceans and coastal marine environment covering physical, chemical and biological variables. Since the data are so well labelled, locating all monthly timescale ACORN data from the Turquoise Coast and Rottnest Shelf sites (which represents hundreds of files) would be as simple as typing the following at the command line:

$ ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc

While it’s unlikely that your research will ever involve cataloging data from such a large observational network, it’s still a very good idea to develop your own personal DRS for the data you do have. This often involves investing some time at the beginning of a project to think carefully about the design of your directory and file name structures, as these can be very hard to change later on. The combination of bash shell wildcards and a well planned DRS is one of the easiest ways to make your research more efficient and reliable.
2. Data provenance

In defining my own DRS, I added some extra fields to cater for the intermediary files that typically get created throughout the data analysis process. For instance, I added a field to indicate the temporal aspects of the data (e.g. to indicate if the data are an anomaly relative to some base period) and another for the spatial aspects (e.g. to indicate whether the data have been re-gridded). While keeping track of this information via the DRS is a nice thing to do (it definitely helps with bash wildcards and visual identification of files), more detailed information needs to be recorded for the data to be truly reproducible. A good approach to recording such information is the procedure followed by the Climate Data Operators (CDO) and NetCDF Operators (NCO). Whenever an NCO or CDO utility (e.g. ncks, ncatted, cdo mergetime) is executed at the command line, a time stamp followed by a copy of the command line entry is automatically appended to the global attributes of the output netCDF file, thus maintaining a complete history of the data processing steps. Here’s an example:

Tue Jun 30 07:35:49 2015: cdo runmean,30

You might be thinking, “this is all well and good, but what about data processing steps that don’t use NCO, CDO or even netCDF files?” It turns out that if you write a script (e.g. in Python, R or whatever language you’re using) that can be executed from the command line, then it only takes an extra few lines of code to parse the associated command line entry and append that information to the global attributes of a netCDF file (or a corresponding metadata text file if dealing with file formats that don’t carry their metadata with them). To learn how to do this using Python, check out the Software Carpentry lesson on Data Management in the Ocean, Weather and Climate Sciences.
3. Backup

Once you’ve defined your DRS and have implemented the NCO/CDO approach to data provenance, the final thing to think about is backing up your data. This is something I’ve discussed in detail in a previous post, but the crux of the story is that if your starting point files (i.e. the data files required at the very first step of your data processing) can be easily downloaded (e.g. reanalysis or CMIP5 data), then you probably don’t need your local copy to be backed up. All of your code should be version controlled and backed up via an external hosting service like GitHub and Bitbucket, so you can simply re-download the data and re-run your analysis scripts if disaster strikes. If you generated your starting point files from scratch on the other hand (e.g. you collected weather observations or ran a model that would take months to re-run), then backup is absolutely critical and would be part of your data management plan.



One Comment

Leave a Comment
  1. Damien Irving / Nov 29 2016 11:05

    Ivan Hanigan has some nice comments on this post over on his blog, where he suggests ordering the fields in your DRS from those that will change only occasionally (e.g. the name of the variable) through to those that change all the time (e.g. sampling time period):

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: