Research data management

Considerations

Having a well-considered plan for the structure and organisation of your research data will improve it’s management, access, and re-use. The organisation of data is especially important in team projects where more than one person will be accessing and analysing the data.

The key methods and approaches to consider are:

  1. File naming
  2. Folders and directory structures
  3. File versioning
  4. File formats

File names

Digital file names are important for identifying and finding a digital file. Researchers should develop and communicate a clear system of naming files so everyone involved in the research project understands and can appropriately apply the file naming rules to create and locate files.

The most important things to remember about file naming are:

Descriptiveness - Good naming conventions should provide useful cues to the content and status of a file, including its version.

Consistent application - By selecting an appropriate naming convention for files as early as possible and follow it throughout the research, the benefit from file naming systems will be maximised.

The following examples highlight basic principles of file naming.

Good file names:

20191004 Registry of participants - Survey.doc

ROBERTSON Thomas Logan - 2011 Interview.mp4

  • File names are concise and meaningful
  • Sentence case including a capital letter for names and proper nouns
  • Full name with family name in UPPER CASE
  • Date format: yyyymmdd. When referring to year only, always use four digits
  • Terms are separated with a dash avoiding punctuation.

Bad file names:

final assignment.txt
Document13.docx
crt doc scan.pdf
Lit review, bib., chpt2-4, rev, cvr page, appendices.docx
output NVB>3.0.xml

  • Bad length. 25 to 50 characters is generally adequate
  • Does not describe the subject or topic of the document’s contents
  • Avoids abbreviations - temp could mean template, temporary etc. Use the full word instead
  • Do not use the following characters within the document name: . , ; : = \ / * ? “ < >

Consider the advice in the Document Naming Guidelines [PDF, 258kB] from Curtin Information Management & Archives and how it applies to your own research. This is for Curtin staff only and requires being logged in first to the Curtin staff portal).

Folders and file directories

Like file naming, systems to organise folder and file directories require coherency and consistency.

Coherency - Anyone using the folders should be aware that there is a system and what it means.

Consistency - Anyone using the folders should be consistent in creating folder names in line with the system, but also in keeping the relevant files in the appropriate folders.

This will ensure it is easy to locate, organise, navigate and understand the context of all files and versions.

Other concepts to consider include:

  • File hierarchy refers to the number of levels or sub-folders in the directory. It is usually useful to have a maximum depth of 4 folders
  • Folder direction determines how folders are nested and which way is most useful (e.g. Results/2012 or 2012/Results)
  • Ambiguity or overlapping categories, especially at the top-level, can cause confusion

View the UK Data Service Organising Data resource for a brief outline on structuring and organising data.

Versioning

Over the duration of a research project, a dataset will undergo many changes. They may be as simple as adding more sets of findings, or as major as the addition of a new dimension or type of measurement.

Keeping a close track of what changes have occurred to your dataset is important for 2 reasons:

  1. These changes are certain to change the conclusions derived from the dataset. In order to maintain integrity about where your conclusions came from, it’s important to know which version of your data you’re addressing.
  2. The changes you make might ultimately not be useful. If this happens, you may need to go back to an earlier version of the dataset.

One of the tools used to address these needs is versioning. This refers to a system of keeping the old versions of a file and tracking the changes made in each version.

The most basic forms of versioning are manual systems. These usually contain two important elements:

  1. The user adds a sequential number to the file name to indicate which version of the file it is
  2. A change table in each document where versions, dates, authors and details of changes to the file are recorded

These are outlined in the ARDC Data Versioning guide and the UK Data Service Version Control and Authenticity page which outlines the best practice of version control and provides examples of file versions.

While a manual system can work for many research projects, they can become difficult to use once your needs become complex or multiple people begin working on the same dataset, as explained in the video below. In these cases you should consider using version control software such as Git. It is the best known and most widely used type of this software. Git is a free and open source distributed version control system designed to handle everything from small to large projects.




File formats

By working with file formats that are widely-used, interchangeable and with good long-term preservation qualities, you will improve the impact and reach of your research outputs. Choosing good formats will improve the accessibility of your research and make it easier for yourself and other future researchers to use or reuse with a wide range of computer systems regardless of available software packages.

When performing research it’s often necessary to use specialised and proprietary file formats. This may be for many reasons: your method of data analysis; the hardware used; the software available to you or to meet discipline-specific standards. Regardless of these issues, it’s still important to make a conscious and informed decision on choosing file formats.

At a minimum you should consider:

  • Proprietary or open formats and whether specific software is required to access your data
  • Maintaining data integrity by using lossless formats to ensure no useful data is lost to future researchers
  • The risk of file format obsolescence which may be the result of imminent or future software or hardware upgrades or practice shifts

At later stages of your research, such as when publishing traditional research outputs or making your data publicly available, you should consider transferring your data to a file format that can be utilised by people who may not have access to the exact suite of software you have. The UK Data Service Recommended File Formats table can help you use a file format best suited to long term accessibility.