File format registry report released

A couple of months ago I reported on this blog that the OPF was beginning a project to investigate options for a new approach to file format registries.  We’ve just released the second report of this activity: “A New Registry for Digital Preservation: Conceptual Overview“. (330kB PDF); It explains our vision of a ‘registry ecosystem’ that will enable organisations worldwide to contribute and share information about file formats, while maintaining the ability to make independent and local decisions about preservation policies.

Please take a look and tell us what you think – leave a comment here or send me a mail  ([email protected]).

The document explains the reasons why we think a new approach to file format registries is required and outlines the main requirements for such a registry.  Our plan for the rest of 2011 is to start putting some of this into action: see the ‘Planning’ section of the report for details.

Most importantly, we’d like to ask for your input and contributions to this process:

  • please review the document and give us your feedback to help us refine our plans

  • consider whether the registry ecosystem we propose would be a good match to the needs of your institution

  • let us know if you would like to participate in making this idea a reality and what you could contribute.

  • We’re hoping for the usual insightful and thought-provoking comments from the readers of the OPF blog!

    3 comments

    Dirk von Suchodoletz's picture
    Dirk von Suchodoletz wrote 26 weeks 3 days ago

    Practical Issues with Currently Available File Format Software

    To get a better and more thorough impression of the activities andchallenges a national archive faces, I undertook a few of weeksresearch with Archives New Zealand, the National Archives of NewZealand. The original idea was to dig into emulation and emulationworkflows, but pretty quickly another topic emerged – the fileformat/rendering software detection problem.Thus it is great to see some activities to solve this problem!

    At the beginning of my research trip we conducted a small survey of the digital files archives New Zealand holds that had already been copied from old media in order to get a number of suitable digital objects to test emulated original environments with. It produced a number of few hundred of objects which were run through the Droid detector (used in Windows XP). Some files were (re)checked with the Linux file command. Working on that primary set of example files a number of issues were discovered. For the older files in particular, PRONOM/Droid and Linux file fail pretty badly. This is particular important for the older files as there can often be less information or metadata on those files than for more recent ones. The detection tools do not come up withreasonable results for older WordStar, WordPerfect, MS-Word, … andfiles we found from old databases.

    A very interesting case was a set of files from the late 1980s. Theyhad names like CAT23BB.DAT. It was not possible to get any meaningful results from PRONOM and Linux’s file. They were originally sitting on old DEC-structured 5.25″ disks and were unable to be read with a standard PC floppy drive (but fortunately someone had already done the copying to a shared storage devise otherwise we would have not been aware of them in the first round). Only a dubious sheet with some short notes on it (within the floppy disk box) gave away a hint on the data format, it had a scrawled note on top of other text that could just be read as saying “dataflex database”. Of course this application is not available here and was not archived with the files. There are ODBC importers available for dataflex but the set of files that we had was missing some important structural files that were required to open thedatabase itself. This case of digital archaeology is still to be solved, we are attempting to acquire a copy of dataflex via ebay to see if this will help.

    In a next round files from the one of the Crown Research Institutes ofNew Zealand were cursory reviewed. Beside offering a similar collection of document files like in the archive there was some more peculiar fileslike programs that formed part of research projects and theses writtenin Turbo Pascal and Perl.

    In a further investigation the holdings of the archive were checkedfor material which had not yet been copied off of the original media. It resulted in material on 3.5″, 5.25″ and 8″ floppy disks and ZIP disks. No evaluation of the content yet (and not simply possible for the old 8″ – btw. does any institution holds such a device?).

    peterVG's picture
    peterVG wrote 23 weeks 1 day ago

    In sync with Archivematica project design/requirements

    Having participated in some of the informal discussions about this at iPres2010 I am very happy to see how this report has turned out. Nicely done. The conceptual overview is very much in sync with the Archivematica project’s requirements and design for interacting with external registries. In particular:

    • recognition (for whatever reason – good or bad) that there will likely be multiple online registries for different functions and possibly duplicate registries performing the same function. Archivematica will require interaction with three types of registries: file format identification, file format policies, file format risk assessment. 
    • seperation of factual information from institutional/project policy
    • technical: registry API, peer-to-peer architecture, local caching, linked data

    In the Archivematica project we are especially interested in the ability to share our default project format policies (http://archivematica.org/preservation) and any institutional policy customizations made by Archivematica users together with policies from the digital curation community at large. It is difficult without a lot of cumbersome research to get a decent sense of community consensus/trends in relation to ‘best practice’ preservation and access formats. A community-based file format policy registry should make this easier.

    The primary registry requirements on our 2011 development roadmap are:

    • publish default Archivematica file format policies in a structured form to an online registry. This needs to include (1) source format identifier(s) (2) preservation file format to which the source format will be normalized (3) access file format (4) transcoding tool identifier (5) transcoding tool command & parameters
    • publish any customizations of the default Archivematica policies made by local users to the same registry
    • pull any default Archivematica format policy registry changes/updates into Archivematica installations where users can make a decision to accept or reject format policy changes/updates

    We’d much rather implement these requirements under a wider community banner (e.g. OPF) than maintain our own registry.

    Given the OPF 2011 plans for further work on the registry we might be able to collaborate (at least give further feedback) on the registry data model work you are planning as well as to test read/write of your API via Archivematica.

    Cheers,

    –peter

    Peter Van Garderen. http://archivematica.org project manager.

    P.S. For our initial stages of adding registry interaction to Archivematica, format policy changes/updates are expected to arise from manual analysis and decisions (e.g. availability of new open-source tool that allows us to add a previously unsupported preservation format, changes in community ‘consensus’ about the risk/sustainability of preservation format x). In a second phase we would focus on integration with format risk assesment registries that allow for more comprehensive and systematic risk alerts and/or recommended format policy changes.

    Barbara Sierman's picture
    Barbara Sierman wrote 22 weeks 1 day ago

    KB-NL feedback on OPF registry

     Barbara Sierman


     OPF/NA wrote a proposal for a new Registry for Digital Preservation and asked for feedback. The KB-NL has a lot of experience with this material while setting up the development and implementation for the Preservation Manager, developed by IBM (but which is currently not in use by the KB as it is not compatible with our e-Depot- DIAS version). The following remarks are mainly based on this experience and we thought by sharing this we might add some interesting points to the discussion.




    1.  One of the things we discovered was that we lacked at that time the technical skills to describe for example the technical configuration like hardware, software necessary to render a certain file format. It seems that OPF is solving this by combining information from various sources. However it still will be important that every organisation has staff that is able to make the right decisions  and can check the information in the registries. A system of qualification of the information in the registry might be a welcome addition in the OPF proposal


    2.  At the KB we found it very difficult to decide the level of granularity of the description, for example do you describe the whole technical environment with all the additions like mouse, video card etc or do you start at a certain level on top of a basic environment. This point is not discussed in the OPF proposal, but it will need a solution. This solution will also have its consequences for the addition of a unique identifier to format, software, hardware etc


    3.  In the KB-Preservation Manager there was a possibility to give information per file format but we found it a shortcoming that it was not possible to describe an “combined” environment necessary for rendering complex objects (combinations of more file formats with each different hard/software requirements). How will the OPF proposal handle this?


    4.  We have been thinking about testing the soft/hardware environment necessary to render a certain file format in a certain controlled environment, to test whether the description was “complete” but we concluded that this was not an easy task, even though we used a controlled environment named the Reference Work Station ( standard configuration).


    5.  We concluded that sometimes the description of the necessary environment needed to render a certain file format is not enough. Sometimes not only a specific object with its file format, but the collection of which the object is part, has characteristics that require extra software or hardware or requires specific software/hardware. For example a collection that has sound in it, but that requires a very high quality soundcard  and speakers to faithfully render that sound. Information on a high level about the rendering of a file format might just not be enough in this case


    6.  If we want the connected registries to work in a right way, then we can’t avoid to make a set of rules of how to behave, like for example never delete information on a certain file format, otherwise the registries themselves will not be trustworthy.


    7.  As far as I now at the moment there are not that many registries with the kind of information we do need. But there are many people in organisations doing research in different aspects of many file formats, mainly for their own organisation. Could not this proposal be combined with a proposal of how to get a collaboration between these researchers, for exemple by making a list of top priority file formats to tackle. The KB did a lot of work on the JPEG2000 format, based on that research even the ISO standard will be adapted and errors will be corrected (Johan van der Knijff will soon publish an article on this).


    8.   Related to this is the following; in my opinion you will need information in the registry that enables you to make decisions (so straightforward information whether the format allows the use of passwords) but on the other hand you might need more descriptive information which explains certain features of a format. I missed this distinction.


    9.  The proposal is not so clear about the facilities an organisation need to have in place before being able to start with the OPF Registry. Will there be an opportunity to install a local instance of the registry?


    10.  One of the aspects that is not discussed, and was very heavily discussed in UDFR is about the governance of the information. In my opinion if this registry does not have a certain set of rules and instructions about how to create and maintain information, the registry is doomed to become a registry one might consult, but without the status of a professional source of information. As I said, also these kind of registries need to be trustworthy.
    Please register or login to post a comment.

    Recent comments

    • Thanks for the correction Gareth. I think that was probably my misunderstanding! Looking forward to...
      paul 1 day 2 hours ago
    • Hi Paul, thanks for the write-up. Just to clarify an aspect of my talk - it's the Autopsy front-end...
      garethknight 3 days 18 hours ago
    • And here's an update on the status of the UDFR from the LoC's excellent digital preservation blog,...
      andy jackson 2 weeks 5 days ago
    • Hi Johan and Andy,   I agree with you both that some formats are worse than others with this,...
      ecochrane 3 weeks 19 hours ago
    • I have to agree with Johan, in that this depends very much on the format in question. There have...
      andy jackson 3 weeks 21 hours ago

    Follow Open Planets Foundation on: