FAIRytale. Common themes in data management emerge at the MRC's open data workshop

A a personal story about working with FAIR data

In October of 2022, I hosted an all-day seminar at the MRC Biostatistics unit to give a community of like-minded researchers a chance to discuss and explain the implications of open-data practice for their research [1]. The day was a resounding success, with a number of excellent talks and conversations. One thing I found particularly interesting – as a member of the research support team – was how often the same themes re-emerged despite the range of research interests represented. This feeling has only strengthened during my time as a fellow.

While all the conference attendees were from fields broadly considered ‘the life sciences’, the event had an interdisciplinary feel. Even so, throughout various applications, the data needs (and challenges) remained remarkably consistent. Many attendees had heard of the FAIR principles [2][3][4] before, but those that were new to the concepts were surprised by how well FAIR captures the challenges they faced. The dominant feeling over our four sessions was that often, researchers get half/most of the way with their open data effort but then undervalue how important those final steps are.

One ‘final step’ that was often missed was providing long-term hosting solutions for data, and several conversations revolved around how important this is. We heard several horror stories of important data sets that were (in theory) ‘available on request from the corresponding author’ but in actuality, could not be accessed, sometimes as close to 6 months after publication date. Similarly, several attendees had navigated to self-hosted websites that gave 404 errors within a year of being mentioned in published research. While this might seem disheartening, this is something of a solved problem with Figshare [5] and Zenodo [6] providing hosting platforms that will last for as good as forever.

The presence (or absence) of appropriate metadata was a second key finding of our workshop. This was particularly important to researchers using genetics data as minor convention/naming differences could cause several hours to be lost when trying to run the data through a new algorithm. Unfortunately, it can be difficult to establish one-size-fits-all metadata principles as the required information tends to be very field-specific. In an ideal world, everyone would provide ‘as much metadata as possible’, but several colleagues were keen to point out that this is often impractical and leads to crude full data dumps in which useful information is drowned out by unhelpfully large numbers of files containing information of little to no value.

The workshop had a huge range of perspectives, and I found the day fascinating. I hope this FAIRy Tale has helped to illustrate that doing the basics really well can be a huge source of value for both you and future researchers. In particular, it’s essential to use a long-term, stable hosting platform for as much of your research data as possible and to provide accurate, complete and (ideally) concise metadata so that a new researcher can get up to speed with your data as quickly as possible.

Further reading

  1. Open data. https://www.ukri.org/manage-your-award/publishing-your-research-findings/making-your-research-data-open/
  2. FAIR principles. https://www.nature.com/articles/sdata201618
  3. FAIR pointers Carpentries course: https://elixir-uk-dash.github.io/FAIR-Pointers/
  4. FAIR in (biological) practice Carpentries course: https://carpentries-incubator.github.io/fair-bio-practice/
  5. Figshare. https://figshare.com/
  6. Zenodo. https://zenodo.org/