Librarians and the failure of data interoperability. (hint, its our fault)

Here’s an interesting happening that is a good example to learn from..  A paper was retracted because the data it relied on was embargoed – that is to say the original creators of the data restricted its dissemination, and when they saw it had been used they complained, quite rightly.

None of the original or subsequent authors was at fault.  Why?

This would make a great essay question if you were doing the (probably forthcoming) issues in research data management paper for your MLIS.

Here’s what happened.

The data was created, and then uploaded to a database with an embargo – no one could use it for anything for a length of time.  Why was it a good idea to upload it in the first place?  Because the researchers realised that if they published from the data they would be responsible for making it available as more and more funders and publishers demand the data is there for review.  It’s a good thing, it means that the research is less likely to have been fabricated and any replication of the research can compare the raw data to see if there are interesting differences.  Making things public and available has the effect of keeping honest people honest.

So, we have the data in a database, but there is data with it saying ‘don’t use this, it’s mine, I paid for it, but you can use it after x date, and if you are wanting to check it, I’ll give it to you privately’.

Here comes the next bit – interoperability.  This is a really important concept that measures how computers can use the same data, no matter what operating system or programs they are using.  It’s fundamental to the internet working.  It is defined by documents that say that systems ‘must’ do some things, ‘should’ do others, and ‘might’ do more.  The more a system complies with thses standards, the more interoperable it is.

So, why is this important?  Well, the data was put into one system, with the embargo data, and then it was picked up by another system without the embargo information.  The second set of authors saw it on the second system, and used it, entirely unaware of the embargo data.

Its not that the second system failed in its duty to confirm to standards, it’s that there are no standards. The Research Data Alliance is trying to develop standards like this, but until they are adopted, they are just aspirations.  It is up to us to make sure systems are interoperable.  Researchers will just freak out and not make their data available if their very reasonable requests on how it is used are not respected.  They won’t worry about data repository metadata interoperability, they’ll just think its dodgy, and prone to really awful failure.

In this case, the authors of the retracted paper couldn’t have known they were not honoring the original creator’s wishes. Well, they could have, if they had done 10 minutes research, but researchers are not expected to be rights experts – that task is ours, as librarians.

So, how could we avoid this happening in the first place?

  1. README files.  This is the 80/20 solution.  If the data is combined with an explicit statement of licencing conditions (including embargos) then at least that’s best efforts covered.  A tab in excel, a zipped set of files containing the README, whatever.  When we teach best practise this needs to come first and foremost!  I’ve seen people poo-poo the data/library Carpentry ‘Data in Spreadsheets’ lesson, but it’s almost the most important!
  2. Make sure our systems understand the licensing layer – and that means formal licences (thank you Creative Commons! At least that work is done) and informal ones like embargos. And anything else that crops up.  Imagine (and I’d love to see this) a caveat that encourages that any product of the data is published as OA, much like software licenses do.  (No, this isn’t CC-SA)

How else can we robustly respect author’s wishes for their data?

Leave a Reply

Your email address will not be published.