Creating a Community Owned digital Preservation Tool Registry (COPTR)

Because every DP project needs an over the top logo/acronym combination Last year I blogged about my frustrations related to digital preservation tool registries. Rather than pooling all of our knowledge in one place and creating a valuable community resource, we’ve spread our knowledge about tools thinly across the web. Instead of seeing collaboration between organisations working in digital preservation, we’re actually seeing competition! Virtually every organisation involved in the field promotes it’s own registry or tool list. This is a ridiculous state of affairs. As I observed at IPRES last year in my least eloquent but most frequently quoted moment, it’s a big fail for our community.

Two weeks ago I presented a proposal for the creation of a community owned tool registry to the latest workshop on Aligning National Approaches to Digital Preservation, graciously hosted by the lovely people at IDCC. I’m pleased to say that the proposal was one of four key areas prioritised for further action, and I’m now leading some initial activities to take things forward with backing from ANADP (note that a full report from ANADP on the workshop outcomes will be available here shortly).

However, I’d like to get even broader support for this community proposal from everyone who has their own registry or tool list, whether it’s a quick blog post or a full on registry. If that applies to you/your organisation then I’d like you to participate in the following way:

Provide your requirements for a community tool registry (the call for requirements will appear shortly)
Merge your own tool registry data with the new community registry
Link to, expose (a view onto) and promote the community registry from your website
Delete your own registry and agree not to set up any new project owned registries/lists
Contribute any effort you have in adding new tools over time, to the community registry

Exactly where the new registry will be hosted and maintained is yet to be decided (quite possibly a “neutral” URL/location. Whatever meets our requirements!). This will require some practical work to establish but is certainly not insurmountable. The key issue is to get buy in from the community. As I note in the proposal, we already have support in principle from the Library of Congress, the Digital Curation Centre and the Open Planets Foundation. This is a great start, but for this to be a success we need a lot more organisations to get involved.

Over the next couple of weeks I’ll be putting together an outline and roadmap as an initial talking point for comment and requirements and sharing it via this blog. So this is my call to arms for COPTR: a Community Owned digital Preservation Tool Registry. Who would like to voice their support and commitment, create a valuable tool registry for us all, and kick off some vital community collaboration in the process?

Submitted by Paul Wheatley on 30 January 2013 – 3:50pm

Comments

Couple of Coptr Cuestions

Hi Paul,

This looks a useful registry to keep up to date and centralised.

I have a couple of questions for you that will help me to understand what you are proposing.

(1) From a distance, it seems like the records for each tool could be held by the PRONOM Software list (http://www.nationalarchives.gov.uk/PRONOM/Software/proSoftwareSearch.aspx?status=listReport), or something very similar in construction. If the idea is to reduce replication and record set diversity, would it be viable to host the tool set in PRONOM rather than in another new location? There are probably a few tweaks that would need to be made to the base data model and of course there would need to be some buy-in from TNA, but these things are surmountable, and it does seem to me that PRONOM is the natural “home” for this kind of thing…

(2) Further to the first question, it would be useful to know what sort of data you are looking to hold about each tool. Are you interested in holding a descriptive listing of tools (e.g. name, capability, website) or are you looking to include binaries – either of the tool itself, or of any supporting documentation (e.g. FAQ, install notes, quick start guide etc). Is there also any desire to hold implementation examples and / or user experience information?

Best,

Jay

Submitted by Jay Gattuso on 3 February 2013 – 10:58pm Permalink

A pragmatic approach

Hi Jay,

Thanks for your post and questions!

I’m not sure I’d agree that the issues you raise are surmountable, having had some experience of working with PRONOM as part of the Planets Project. In principal, it would be great to add to PRONOM, and it pains me to have to suggest creating yet another tool registry when the problem is that we have way too many tool registries already! However, I don’t think PRONOM meets the requirements we have for a successful registry that many organisations can seriously buy into.

There are a number of problems. Updates to PRONOM have to be made via a bottleneck of human validators at the TNA. In other words, every submission must be checked, validated and committed by them. Others have pointed out their frustration at failing to get updates into PRONOM. I don’t want to speak for the TNA, but I believe that their current strategy is to focus primarily on file format signatures and DROID. It’s great to have David Clipsham concentrating on this important DP work, and this is to be applauded. But updates of format or tool information in PRONOM do not seem to be a current priority. To be fair to TNA, they can’t do all of this stuff on their own! No organisation in this field has the resource to manually build/validate a registry of this kind of information. There are also issues with the design of PRONOM itself. The PRONOM interface does not provide very useful facilities for browsing the tool data, and this is critical for a useful tools registry. Finally, PRONOM is of course very much a TNA owned and branded site. This might perhaps be a barrier to buy in from other organisations.

I think a wiki based registry would address most of the issues I’ve addressed here. The critical need is to make it easy for any contributors to be able to get new information into the registry. As File Format November demonstrated for format registries, the wiki approach can be very successful in supporting this.

As regards data, the focus will be on basic information about each tool and users experiences with applying the tools. This will support users in finding appropriate tools for their needs. That’s the use case this registry will address. The registry will not hold binaries or documentation which would introduce all sorts of additional complexity and overhead. It’s important to be pragmatic. Too much complexity or ambition will I think result in an empty registry that the community is unable or unwilling to contribute to, as we have previously seen to an extent with PRONOM and UDFR.

This is the very brief outline I created for the ANADP proposal, of what the registry will do:

Provide descriptions of tools, links to source code and executables, and links to experiences in using the tools (so others can learn where and when the tools could be best applied)
Be wiki based, allowing anyone in the community to contribute to it and maintain it
Have tags for each tool, allowing different views onto the data to be tailored for organisations with different needs/foci

I should have a rough COPTR demostrator ready very soon, which will make some of these points a little clearer and will hopefully act as a strawman for comment and feedback.

So how does that sound, and do you think that’s a sensible focus?

Cheers

Paul

Submitted by Paul Wheatley on 4 February 2013 – 2:00pm Permalink

I agree with Paul

Just to add, while I’d very much love PRONOM to be a complete all-in-one registry that suits everybody’s needs, it isn’t possible given current resource to push PRONOM down the route Jay suggests at this stage.

I would hope however, that a Linked Data PRONOM would be a step in the right direction, so we could focus on what we do, and other registry providers could focus on their interests and together we chain together something approaching a ‘complete’ registry.

David

Submitted by David Clipsham on 5 February 2013 – 3:50pm Permalink

Follow up comments

Hi Both,

Paul, thanks for the clarifications – it’s very clear that you see the ‘wiki’ / community contributions aspect of Coptr as being central to achieving this aim, and I really don’t disagree with that view.

It would be very cool to have a constrained data model at the centre of the concept – picking up on David’s comment about the chaining of things together, it would be hugely desirable to have a fixed record structure that could be linked to / exported out with a good degree of confidence in the long term consistency of the individual record structure.

It’s probably very early in the development to be considering the long term ‘locking down’ of data structures, but I think it would be a good thing to have registered as a long term intention of Coptr to help aid sustainable buy in / use. Related, I would love to see more about the various user actors (people user classes & system based users) that you envisage as being the consumers of Coptr data, and what views into the data you expect to be providing (e.g. html, xml, sparql) and what sort of granular level you want to support (simple overview of each tool, full view of all tool data, classes of related tools, etc)

I’m really looking forward to seeing the concept demonstrator, I think this will be a really useful resource once it’s up and running!

David, noted! given Paul’s reply I can see how there are a number of features that he wants that PRONOM could not support.

That said, there is a risk of some overlap between the Coptr and the PRONOM software list – is there a plan to tie these things together in anyway, or does TNA see Coptr as being a distinctly different beast? (I can see for example that a boundary between to the two registries is that PRONOM is describing software that is solely used to render / ‘perform on’ files, and Coptr is looking at software that functions on sets of files in a descriptive capacity, which is somewhat different to a file performance.

I’m very interested in your comments about chaining registry entities together to form a ‘complete’ registry… in my view, this would certainly have a hugely positive impact on the sector.

Submitted by Jay Gattuso on 6 February 2013 – 8:39pm Permalink

The registry chain

Hi Jay,

I think it is entirely fair and accurate to suggest that The National Archives’ current focus for PRONOM is very much geared towards it being a file format identification/description registry.

It is true that PRONOM was originally built with the concept in mind of being an all-encompassing registry, that would describe software tools in detail and would cover similar ground that Paul is anticipating that COPTR will cover. However over the years, we have come to realise that this is simply too big a task for an organisation like ourselves.

Our focus has therefore shifted firmly towards the file format space, because this sits naturally with our own internal priorities and practice.

We welcome other organisations joining the registry space and very much hope that COPTR will grow to fill many of the gaps that we cannot and we are prepared to help in any way we can.

I believe the best-case scenario would be to have a handful of registries that focus or specialise on different aspects of digital preservation needs, that can cross-reference and offer natural links between each other.

To achieve this goal I think that Linked Data/Semantic Web technology would be the optimum paradigm, and I would certainly recommend that COPTR considers this approach.

I also think that having distinct, geographically and institutionally-separated registries offers a greater degree of robustness than having a sole registry that ‘does everything’ and that everybody relies on

I think some degree of overlap between registries is entirely inevitable, but I would say too much information is better than too little.

David

Submitted by David Clipsham on 7 February 2013 – 1:16pm Permalink

Simple registry->Community->Data->Fancy stuff. In that order.

David and Jay,

Thanks again for the thought provoking comments. Lots of useful stuff to digest there! I’ll see if I can respond to some of the key points. (and David: I’m glad I managed to get the TNA position about right!)

I think, as Jay touched on, the use cases and purpose of COPTR is at the root of considerations on the approach and technology that the registry should use. The starting point for wanting to do something in this sphere came from seeing so much duplication in this field. Seeing precious DP resources spent on developing tools that are already out there. 2 years ago we started running mashup events that brought together preservers and techies in order to solve concrete preservation problems, and realised that almost all the challenges we looked at could be solved using existing, open source tools. Open source tools written by people outside this community. Tools that were often not well known by many in this community. Where our mashup events had a real advantage was using the knowledge of 30 people in the room to find the right tool to solve the problem. Someone in the room would know about that obscure bit of open source for pulling attachments out of .msg files… The problem is, what do you do when you’re solving a DP problem and don’t have all that expertise to hand? How do you find the right tool? Give up and write it yourself? This happens…

As I’ve blogged about, loads of people have put together lists of useful preservation tools. The purpose of almost all of these registries and tool lists is to help users manually browse for tools to solve particular preservation problems. Not to do anything more advanced than that. The problem is, as I have described previously, these lists don’t do the job very well, because there are hundreds of them and none of them have very many tools listed in them.

So this is the problem area that COPTR is seeking to solve. It’s a very simple target. You might even say it requires no more than a crude solution. But given the background I’ve described, I believe there is a clear need for an effective solution in this area.

Moving on to the topic of machine readable data, or taking a complete linked data approach: I’m unclear what use cases these solutions will address right now? They don’t solve the use case I have outlined above. And this is why I don’t think they are important *now*, for COPTR. Later on, when we understand the other tool/format registry use cases better I’m sure we’ll need to head in this direction.

So for example, one new use case might be to identify a file as a particular file type and then interrogate a registry for an appropriate rendering tool. I don’t know anything about this use case. I don’t know who needs it, I don’t see any evidence that there’s a pressing need to satisfy the need now. The scope, focus and type of data held in a registry that solves this use case would all most likely be different to COPTR.

So this is where I see comments from both of you on an evolving registry ecosystem (as I think Bill called it) coming in. I liked this line from David: “I believe the best-case scenario would be to have a handful of registries that focus or specialise on different aspects of digital preservation needs, that can cross-reference and offer natural links between each other.” Yes, beautfiul, but this is something we should gradually move towards, not design from the top down, now.

COPTR is starting off simply in order to tackle a simple use case, but it’s important to look ahead and make sure it can fit into a more complex future and be complimentary with other registry data sources. But no more than that. I don’t want to start designing a specific data structure for COPTR to meet what I think users of another registry will want out of COPTR in 5 years time. This registry ecosystem does not yet exist. Jay, I very much agree with your comments on trying to maintain a consistent data structure. As you say, I definitely don’t want to “lock it down” now, as I’m hoping we will understand a lot more about what we want from the registry as we get people using it. And we’ll need to adapt along the way as we learn. But with a reasonably tight structure, and plenty of wiki output options there are of course other opportunities to start exploring other uses of data in COPTR…

And this is another fundamental point about the approach. First we need users and data, then we can build funkier stuff at a later date. As Andy Jackson said recently on twitter (after I’d corrected his typos): “Build it and they will come = FAIL. Bring them together and they will build what they need = WIN.” The critical thing is to bring the community’s effort together in one place and first address the “hundreds of tool lists, none of them are any good” problem.

Jay, you touched on scope. The scope for COPTR is likely to be very wide. Tools that fulfill any of the functions in whichever DP functional model (eg. OAIS) or lifecycle model (eg. DCC) takes your fancy.

Jay asked me two more specific questions that I’ve touched on but not answered clearly. Firstly, user actors. This is pretty straightforward. DP Practitioners, data stewards, developers who are faced with a DP challenge, and they need a tool to meet that challenge. Google might turn up some tools, but its often hard to find the right search terms. Through tagged and categorised lists of tools, these users will be able to easily discover tools that might meet their needs and then also read about and share users experiences with those tools. That’s it. Nothing more complex. Nothing about view paths, automatically launching renderers or automated characterisation workflows or anything fancy like that.

This pretty much answers the second question on views onto the data. Views need to be optimised for human browsers not for machines. So it’s the presentation of the data that helps human users find what they’re after. Again, I don’t see a use case or clear need for anything else *at the moment*.

By far the most significant challenge to progress is getting the community buy in, and that’s what I want to focus on with COPTR.

I’m hoping this isn’t too controversial, and I hope I haven’t said anything that worries you guys unduly? If so, please tell me *:-)

Paul

Submitted by Paul Wheatley on 7 February 2013 – 2:29pm Permalink