Format Registry Challenge, Part Three

Format Registry Challenge, Part Three

Somewhat later than planned, I present here my final installment on the Format Registry Challenge.

During this “experiment” I have spent some time becoming acquainted with the Git version control system, working with both Windows msysgit and the Eclipse Git plugin EGit. One result is that my prototype registry source code (and associated war archive) is available on GitHub:

https://github.com/rcking/formatregistry

I have included instructions for a quick local deployment of the war archive here and I would be interested in any feedback and/or bug reports.

During the past few weeks, I was not able to work intensively on the code, but I did find time to improve the “look and feel” of the application through stylesheets and tabs. I also debugged the format editor, although I did not include input validation.

Most importantly, I added some import/export features. The application can now add/update records by uploaded a Fido “formats.xml” conforming tothe Fido 0.93 XSD. If the record defines a new format (which means, it has a heretofore unknown Pronom ID), a new format record is created. The question I had to answer here was, what should happen if the format already exists? I decided to introduce a new property for Fido signatures called “Prioritize”. New signatures loaded with new records are assumed to have priority over the old ones, although no signature is erased. No checks are made regarding duplication, so if you import the same formats.xml twice, you will have duplicate signatures. However, only one set of signatures will have “Prioritize” set to true after import.

This is significant for the export functionality of the application. If I choose to download the contents of the registry as Fido formats, only those signatures with “Prioritize=true” will be included in the the export.

I also experimented with an export to Pronom formats. As I documented in my previous posts, I extended the original PRONOMReport class to included Fido signatures, and to make use of Java Enumerations for certain attributed. I now provide a simple XLST transformation on all registry XMLs that strips away these changes, leaving Pronom-compliant XML. It is not clear to me, however, whether such files could be used to update the central Pronom Registry. Given the number of first-class objects that appear to be in that registry, implied by the number of unique identifiers in the serialization, I assume that this kind of import will not work. I would however argue that this kind of compatibility between any two registries would be a desirable feature, in particular if it can be obtained so inexpensively.

Finally, I have noticed that Git provides an interesting model for how a multiple registry applications could work. Namely, I can clone a central format XML repository to my local machine and configure my Format Registry application to use this directory as its storage backend. Now I can search and edit the latest formats (and generate the FidoSignatures as well). I can add-to and edit these to my content, and commit the changes to my local Git repository. When I feel I have made significant improvements to the corpus (or even just one format), I can push the content back to the central repository. I have created just such a repository here:

https://github.com/rcking/formatxmlrepository

Of course the difficulty in this approach is in defining an agreed-upon schema for the format descriptors, one that could be used by many different applications.

This leads also to the additional problem of unique format IDs, or Pronom Unique Identifiers (PUIDs). In order to be useful identification labels, and to support interoperability between applications, PUIDs should remain unique world-wide. For my registry application, I invented a new “PUID-namespace”: o-fmt/ where “o” stands for “open”. But this does not get around the fact that we would need a central body to assign or distributed unique identifiers.

I note that applications using the signature files have a choice: they could consider the “o-“ formats to be too unreliable for their field for example (or the “x-“ formats for that matter). But I have a feeling that an application like Fido that would take all format-types into consideration (and allows a much broader audience to generate regular-expression signatures) will quickly take over the market. I also believe that we will find that the community contributions, while not being 100% accurate, will nevertheless pay for themselves many times over through the resulting expanded coverage.

I will conclude this experiment by noting that Andy Jackson’s approach, using a modified Drupal front-end, is clearly superior. Not only did he avoid costly development time for the format editing pages, but he also gets user management, authorization and faceted search for free. I just hope he was able to glean some useful approaches or ideas from my posts. For me, the experiment was highly educational.

1
reads

Leave a Reply

Join the conversation