What do we need in a file format registry?

Submitted by Bill Roberts on 4 November 2010 – 7:45pm

I’ve just started a small assignment for the OPF to investigate the options for a new file format registry, part of the toolbox needed for long-term preservation of digital material by archives, libraries and other memory institutions. This initiative was kicked off and sponsored by the National Archives of the Netherlands, and is now in progress under the auspices of the OPF.

Most readers of this blog will know very well what a file format registry is and why you need one. But for people new to the world of digital preservation, I’ll very briefly explain: in order to make sure you can still access all your files in 10, 50 or a 100 years you first need to know what kind of file formats you’ve got and the tools available to work with them. For institutions with a responsibility to look after our records of government and cultural history this is a high priority. So step 1 is to systematically track file formats and information about them.

Registries of this kind already exist, notably the UK National Archives PRONOM system, the GDFR system from Harvard University Library and the PLANETS Core Registry (which was itself closely based on PRONOM).

So why do we need another one? The field of digital preservation research is still only a decade or two old and many lessons are still being learned: the first generation of registries have done a great job in many respects but have also highlighted new requirements.

An important issue arises from the large amount of ongoing research effort required to keep on top of the wide range of file formats in use, with new types of digital material and new software appearing all the time. This is not a job that any one institution can afford to do by itself, so sharing of information is essential, between archives, libraries, universities, software vendors and individual experts. Also, the information you need is a mixture of facts and policy choices. The specification for PDF1.4 may not be open for argument, but choices on how to manage PDF files over the long term and what tools to use may vary from one organization to the other.

The problem is in many ways one of distributed web publishing, with the need for unambiguous shared identifiers, so everyone knows when they are talking about the same thing. The information to be stored about file formats is complex and a precisely defined shared vocabulary for format descriptions is essential for effective information sharing. So it’s a very natural fit for Linked Data. That’s one of my main professional interests and laying out how it could be usefully applied to this problem is one of my tasks.

But the first priority is to set out the issues we need to tackle. Over the next couple of months, I’ll be pulling together an outline of the concept of how such a distributed registry could work and aiming to narrow that down to an initial set of requirements.

For an idea of where that is going see the report “A New Registry for Digital Preservation: Outline Proposal for Discussion” (204kB PDF.) I’d certainly welcome input from the well-informed and opinionated (in a good way 🙂 ) readers of the OPF blogs. Please begin your rants in the comments to this post!

It will also be one of the topics of discussion at the forthcoming OPF workshop and hackathon in Amsterdam. I hope to talk to many of you about it there. Registrations close on Friday 5 November, so if you’re not registered yet, be quick!

Bill Roberts’s blog
Login or register to post comments

4 comments

andy jackson wrote 34 weeks 6 days ago

My rant…

I cannot resist an invitation to a good rant. 🙂 Here goes…

Although I think linked data is a very good model for publishing and consuming this data, is it really the best form in which to author it? I think the most difficult part of building a format registry is collecting, collating and validating the actual format data, but a lot of the effort seems to be focused on how we consume the data, not how we create it. Are there any suitable tools for collaboratively creating this information directly, in that form? Wouldn’t some documents in a distributed version-control system be a simpler ‘master format’ for managing the data, the provenance metadata, and the editorial workflow?

Bill Roberts wrote 34 weeks 3 days ago

Andy’s rant

Hi Andy – a good rant is always cathartic 🙂

But not sure who you are arguing with! How the data is created is certainly an important part of the problem and I’d be keen to discuss that with you and anyone else who has an opinion.

For communicating it, agreed identifiers and vocabulary (which roughly corresponds to a data model) is important and that’s where the linked data stuff can play a part.

The hard part of authoring seems to me to be generating the knowledge. Getting it into the appropriate format is fairly easy in comparison.

One aspect of collaboration may be addressable with the idea of distributed publishing – you don’t need to join in to a single agreed collaboration system, you can just publish your stuff and other people can use it (and merge with other sources) or not. Of course it will also make sense in many cases to collaborate more closely and I’m open to suggestions on suitable collaborative authoring schemes.

At the simplest level, RDF can be encoded easily in plain text files, so a text editor and Github is a viable option. A richer and more helpful environment would probably be better, but I think we need to decide what sort of information we are going to create before worrying too much about what tools we are going to use to do it.

Using other systems/formats for storing working data would also seem fine, as long as there is a reliable automatable way of converting that to the agreed interchange format.

Cheers

Bill

andy jackson wrote 34 weeks 3 days ago

Andy keeps on ranting on…

I think the way we design the data model is all about the social process; the shared aim and roadmap, and the collaborative tools that grease the wheels, the validation and governance structure, etc. I fear RDF/OWL/SKOS will get in the way because so few of us really understand these things well (including me!). You don’t need them to define a data model, and as you imply, plain XML or T3 triples in a text files on GitHub would be rather difficult to work with. But I think we need to be able to prototype the thing quickly and easily because I think we’ll only get the data model right if we try to fit lots of data into it and iterate the design.

As well as this, I’m not sure about the need for a distributed registry at all (as opposed to a distributed collaboration that builds a centralised registry), because I’m not convinced mixing ‘facts about formats’ and ‘policies concerning formats’ in a single registry is either wise or necessary. If you can keep the policy statements in (e.g.) policy documents and out of the format registry, the barrier to sharing format information is much lower, as are the requirements for trusting that information.

Furthermore, I disagree with assumption behind your statement that:

“…you don’t need to join in to a single agreed collaboration system, you can just publish your stuff and other people can use it…”

I think that the hard part of all this is the data model, and if we’re not all following the same one, then we can’t just mix and match the data from different sources without an awful lot of overhead. Making it work is not easy, and I’m not yet convinced that all of that would be easier or more useful than a collaboratively-built central repository that focuses on fun facts about formats.

I’m looking forward to discussing these things further at the hackathon!

Bill Roberts wrote 34 weeks 2 days ago

data models etc

I definitely agree that the data model is the key part of this, and that the process of creating, sharing and consuming the data will have an influence on the data model.

As Rob Sharpe has pointed out on more than one occasion, we need to think about the governance model in order to come up with a sensible design. The combination of governance and trust will probably be the key factors in deciding between different centralised or distributed approaches.

And I agree too that we shouldn’t get prematurely bogged down in ontologies and RDF.

My reluctance to start talking about data creation and editing tools is that I’m worried that (a) we’ll pick a ‘nearly appropriate’ existing tool that will constrain our options for doing the job right, or (b) people will rush off and start building web applications, instead of getting the data model and governance processes right.

Open Planets Foundation – A community hub for digital preservation