Algorithms and signatures

PDF Eh? – Another Hackathon Tale

During the past couple of weeks, there have been some thoughtful and well-informed discussions about Fido, Droid, Pronom, and file format identification in the comment stream of this blog.  They make interesting reading.

In a recent comment, Shaun Zevin raises some points about the algorithmic complexity of the Droid and Fido pattern matching.

From the algorithmic perspective, DROID should be faster than FIDO.The Boyer-Moore-Horspool algorithm used by DROID has a linear O(N) complexity in average where N is the length of the examined string i.e. the identified file.

Perhaps my one dubious claim to insight in this matter is that I took a course with Boyer and Moore entitled ‘Recursion and Induction’ while a PhD student at the University of Texas at Austin.  In addition to being excellent and influential scientists, they were also superb teachers.

Sean’s observation is pretty close to my understanding, but the algorithmic claims are about the worst-case behaviour.  That is, the worst-case behaviour for the Boyer-Moore-Horspool (BMH) algorithm on a restricted subset of regular expressions is linear, whereas the worst-case behaviour for matching full regular expressions is exponential.

One way to think about the worst-case behaviour is to imagine a demon that comes up with the most evil possible pattern and the nastiest possible input.  In the case of the BMH pattern language and algorithm, the worst the demon can do is force the algorithm to take time that is linear in the size of the input.  In the case of full regular expressions, the demon can do much worse and make the algorithm take time exponential in the size of the input – and this can be a very long time!

In our case, however, we do not have a demon. We have the angelic Pronom, who restricts patterns to the restricted subset that the BMH algorithm can handle.  These patterns do not appear to elicit the worst-case behaviour from the regular expression matcher.

Furthermore, the regular expression implementation has been highly optimised over many years. In contrast, the core Droid matching code has barely changed since version 2. In the case of the Python implementation, for example, the regular expression matcher is written in C.  This means that the coefficient will be much lower for the regular expression module.  In addition, because the regular expression language can require expensive matching, the implementations do some special casing.  For all I know, it checks to see if the pattern is in the subset that BMH handles, and uses the BMH algorithm in that case!

Sean also points to what looks like a great alternate implementation of the regular expression matching engine that has some excellent performance characteristics.  It is fairly new – and I bet that if it is perceived as stable and efficient, it will be integrated into Python, re-implemented in Java, become widely available, and be used in Fido 2!

…version 38 of a DROID signature file has 44 variable position byte sequences which are killing DROID performance. ..

Sean notes that the Pronom signatures are not highly tuned and that the ‘variable’ signatures that may apply anywhere in the file kill performance.  This is absolutely the case.  As I previously highlighted, content is king here.  As a community, we need to put some real effort into improving and extending the signature set.

Another problem is that PRONOM signatures are clearly underoptimized. … ID 40 from DROID signature file. It has a fixed position byte sequence with 1024 maximum offset which is unique in PRONOM and is 45 bytes long. 3C21444F43545950452068746D6C205055424C494320222D2F2F5733432F2F445444205848544D4C20312E31

This appears to be the pattern for fmt/103, or XHTML.  The Fido regular expression that corresponds to the byte sequence is:

.{0,1024}\\<\\!DOCTYPE\\ html\\ PUBLIC\\ \\”\\-\\/\\/W3C\\/\\/DTD\\ XHTML\\ 1\\.1

Sean points out that there is another pattern for this same signature.  The Fido pattern for it is:

<html\\ xmlns\\=\\”http\\:\\/\\/www\\.w3\\.org\\/1999\\/xhtml\\”.*\\<title\\>.*\\<\\/title\\>

He suggests that the second pattern is not needed.  I must say that the two wild-cards – matching any number of bytes –before <title> and up to the following </title> – can make this pattern pretty inefficient in some cases.  I’ve not seen this one as a problem in practice, but the underlying point is spot-on.  We need to review the signatures and tune them.  Some others, such as the signature for fmt/134 are much more expensive.

It is also essential for us to gather a rich document corpus for regression testing and benchmarking.  The Planets project made a start in this direction, but there is much more work to be done.  The OPF has some excellent ideas here – and we’ll be hearing more about them soon.

25
reads

Leave a Reply

Join the conversation