MUPPET: MUlti Pass file Properties Extraction Tool

Submitted by mauricederooij on 28 October 2011 – 1:33pm

This tool needs some explanation of how it came about. At Nationaal Archief we were faced with various bottlenecks at ingest for our digital repository (which we call e-Depot). Characterization was one of them and when the OPF released the first prototype of FIDO we happily jumped on board for its development. Seeing the potential for significant speed increases, Nationaal Archief put in a substantial amount of work – freeing me for development of FIDO, leading to a wrapped Java version. At that point we were faced with the question of whether we would replace DROID by FIDO in our e-Depot and we paused a moment, for various reasons.

The first one was that we identified Java wrappers as a cause in itself for bottlenecks in the system favouring a command line approach. Secondly, at that time Johan van der Knijff of KB was doing his excellent comparison report on DROID, FIDO, Unix FILE utility, FITS and JHOVE2, giving us more insight in the matter.

Thirdly, by that time Dave Tarrant (University of Southampton) released OPF REF (Open Planets Foundation Result Evaluation Framework), an interface in PHP to hook up characterization command line tools in order to compare the results. After that we started experimenting with a commercial tool called File Investigator from Forensic Innovations, which looks very promising for future deployment in our e-Depot.

At some point Maurice van den Dobbelsteen proposed to take the best of the above and do characterization as a multi-pass process. Start with FIDO to profit from its amazing speed, then run DROID on what FIDO didn’t tackle and then invoke other tools for more granularity – all in one workflow. After refining the idea we worked out the following concept of MUPPET (Multi Pass File Properties Extraction Tool).

You will find an overview of MUPPET and proposed screen shots in this proposal document. We estimate the creation of a prototype of the API and an initial GUI will take approximately 80 hours, depending on individual vs. team effort.

We are happy that this relates to the “What do we mean by format?” discussion here on the OPF blogs, where Rob Zirnstein commented on a layered approach as identification process. This is exactly the idea that MUPPET will operationalise. Feedback welcome!

Attachment	Size
MUPPET concept pitch v1	49.64 KB

mauricederooij’s blog
Login or register to post comments

4 comments

gmcgath wrote 3 weeks 12 hours ago

MUPPET and FITS

MUPPET seems to fill the same niche as FITS. What were the reasons for going with a new proposal rather than enhancing FITS?

mauricederooij wrote 2 weeks 3 days ago

Re: FITS

Hi Gary,

First of all, FITS is a good tool, but it simply does not fit all our needs at Nationaal Archief.

A goal of MUPPET is making life easier for “content owners”. As we wrote in the MUPPET concept, when they want to perform a multi pass analysis with different tools they need to invoke the tools separately and combine the results by hand. Most of the time this is not an easy task and requires expert technical knowledge. FITS is a good step towards this goal but it is still a commandline tool.

Another need is the need for speed. We have found that wrapping mechanisms that require a lot of file handling in Java tend to be slow. With MUPPET we want as far as possible use the operating system to do file handling and process management, so to have a very simple scripting approach to invoke the tools as one would do in a command line environment and collect the output of the tools and present them nicely for the user, while saving them so that they also can be used in our other systems.

Also the amount of work needed to add, disable and update tools is an obstacle to us. If one is not highly involved in a project such as FITS, it is hard to update the tools or signature files. We also experienced this in our e-Depot, where adding new tools and updating tools and signature files requires code changes. With MUPPET we want to tackle this by providing an API where one can enable, disable or replace tools “on the fly”. In fact, adding a tool would only require copying a script, change some parameters such as invocation and how the output is formatted and you are good to go.

Also the output of FITS does not fit our needs. We need a way to save the results in order to generate “pretty” reports. The method of FITS, one output file per analysed file needs extensive reworking afterwards to generate piecharts (for our managers) and the like.

Furthermore I have some technical questions/suggestions which I will send to you personally.

gmcgath wrote 2 weeks 3 days ago

Re: FITS

Thanks for the explanation. I’m the wrong person to ask most technical questions about FITS, since I wrote only some of the metadata schema code, but I’ll be glad to pass them along.

andy jackson wrote 1 week 3 days ago

Similar ideas and tools…

We’ve also identified a similar need for our curators. However, in our case, it’s less important to be able to cross-compare the output from different tools (which is not something I’m sure we really need a GUI app for), but instead to combine the output from tools in the best way the developers currently know how in order to present a summary to the users which identifies basic problems and risks (e.g. ‘content doesn’t match extension’, ‘file is password-protected’ etc.) This nebulous plan has the nickname ‘RAT’ for Risk Audit Tool.

Here are some links to some relevant work in this area:

See NARA’s File Analyser, which has a GUI and leverages the excellent Apache Tika to do the characterisation.
See David Tarrant’s OPF File Scanner (opffs) which does not have a gui but creates a summary spreadsheet. Also leverages Debian packages to help manage plugins.
This EAP AQuA issue and related issues.

So, to return to the proposal, I’m not sure it is sensible to group Content Owners and Engineers together as a single audience. I would argue these audences need separate tools:

Engineers can run multiple tools and compare the outputs using scripts, they don’t need a GUI, and we need to focus their effort on helping them make the existing tools better.
Content Owners don’t want to be forced to reconcile the outputs from the different tools, but need a simple GUI that builds an accessible summary of their content, where the Engineers have selected the best results from a range of tools (taking care only to make statements about the parts we are fairly sure about!). They may want to specify ‘deep’ versus ‘shallow’ scanning, but that’s about it.

Also, note that the hard part is probably the normalisation of the outputs from the different tools and combining/reconciling the results from those tools. Both FITS and JHOVE2 (and Tika and NZME, actually) have ways of making new modules and combining the results, and I would seek to re-use existing code where possible. Note that most of the magic FITS does is in XSLT files that post-process the output of tools, it’s just that the way those outputs are bound means adding or upgrading tools is awkward.

One of the ideas behind OPFFS is to leverage Debian’s package management to make tool management easier. David has set up an initial DROID package, so that will upgrade itself, and tools like FITS could be designed to rely on that package rather than doing the packaging themselves. This would make life much easier! Similarly, setting up a package for Fido would be pretty easy, and then that could be added as an option.

All of which suggests that one way forward would be to take the FITS modules and turn them into OPFFS modules, moving to using Debian package management instead of manual package management. This would rather depend on whether the audence uses Windows or not, but frankly, installing the odd Debian VM is probably going to work out a lot cheaper than implementing our own package management system!

Open Planets Foundation – A community hub for digital preservation