Improved identification of XML: a Python experiment

Submitted by johan on 11 July 2011 – 4:49pm

As a part of the SCAPE project, I’m currently heavily involved in the evaluation of various file format identification tools. The overall aim of this work is to determine which tools are suitable candidates for inclusion in the SCAPE architecture. In addition, we’re also trying to get a better idea of each tool’s specific strengths and weaknesses, which will hopefully serve as useful input to the developers community. We’re actually planning to publish the first results of this work on the OPF blog some time soon, so you may want to keep your eyes peeled for that.

Identification using byte signatures

In this blog entry I will focus on one particular area in which most identification tools appear to be struggling: the identification of XML files. Most identification tools try to establish a file’s format by looking for characteristic byte sequences, or ‘signatures’. Examples of tools that use this approach are DROID, Fido and the Unix File tool. Signature-based identification works well for most binary formats, but for text-based formats the results are often less reliable. This also applies to XML. Signature-based tools typically identify XML by the presence of an XML declaration, which, in its simplest form, looks like this:

<?xml version="1.0"?>

The problem is that not all XML files actually contain an XML declaration. Also, the use of an XML declaration is not mandatory. The XML specification states that for a file to qualify as “valid” XML it should contain the declaration. This merely means that the use of the declaration is recommended (which follows from the use of the word “should” and not “must”). XML files that don’t contain the declaration are by definition not “valid”, but they may still be “well-formed”.

However, if (part of) the declaration is used as a signature, this means that any files that don’t have the declaration will not be identified as XML by any of the above tools. This is exactly what happened in our tests for DROID, Fido and the Unix File tool. DROID and Fido simply leave such files unidentified, whereas the Unix File tool identifies them as ‘plain text’ (which, of course, is correct at a lower level, but not very helpful). Unfortunately, such files are pretty common in practice.

Using an XML parser to identify XML

A different approach to identify these files would be to run them through an XML parser. If a parser can make sense of a file’s contents this means it is well-formed (but not necessarily valid!) XML. In all other cases, it’s something else.

I ended up writing some Python code to see how this would work in practice. I first created two re-usable Python functions that check any given file for well-formedness using Python’s highly performant ‘expat’ parser (based on original code by Farhad Fouladi). I then wrote a simple command-line application around it, which is called “isXMLDemo.py”. The demo can be used to analyse one file at a time, or, alternatively, all files in a directory tree. The output is a formatted text file that contains, for each analysed file, the identification result (which is either “isXML” or “noXML”).

I was surprised at how fast the XML parsing actually is. To give an indication, I used “isXMLDemo.py” to analyse a 1.15 GB dataset that contains 11,892 file objects. I ran this experiment under Microsoft Windows XP Professional using a PC with a 3 Ghz GenuineIntel processor and 1 GB RAM. The total time needed to analyse all files was about 90 seconds, which corresponds to an average throughput of about 131 files per second. The test dataset used contains a large number of metadata files in XML format that do not contain an XML declaration. As a result, neither DROID 6, Fido or the Unix File tool are able to correctly identify these files. With my script these files were all correctly identified, except for one. Upon closer inspection, this file turned out to contain a malformed XML tag.

XML parsing in Fido?

Since the core functions that do the actual XML parsing are completely reusable, it would probably be fairly easy to incorporate this kind of identification into Fido. This would obviously have some impact on Fido’s performance, but not by very much. XML parsing could also be offered as an option. In that case, the decision on whether to parse or not to parse is up to the user.

An obvious limitation of this approach is that it will not identify XML that is not well-formed. Also, it makes the line between identification and validation somewhat blurry, but in practical terms that shouldn’t be a real problem. Finally, one could argue that knowing that a file contains XML is not very informative at all, since it is merely a container for something else. This was the subject of an earlier blog post by Asger Blekinge. However, even then, identifying the container is a necessary first step, and one that the current tools don’t seem to be too good at yet.

Demo

For those who want to do some tests for themselves, I have attached the demo script to this post. The ZIP file contains the Python script with its documentation in PDF format. If you end up with any interesting results, or if you have any other thoughts on this: please report back in the comments!

Johan van der Knijff

KB / National Library of the Netherlands

Attachment	Size
pythonXMLIdentDemo.zip	114.72 KB

johan’s blog
Login or register to post comments

7 comments

andy jackson wrote 20 weeks 5 days ago

Not just about containers…

I do like this approach. In particular, I think actually running a proper parser is probably the only way you could identify CSV files with any kind of confidence.

mauricederooij wrote 18 weeks 6 days ago

Parsing and validating probably the best way

Parsing and validating XML is probably the best way.Searching for an identifying string alone (eg. <?xml) would not be sufficient.3 examples why:

1. Imagine a text document describing how to create an XML document, with an example in the text itself. It would be falsely recognized as XML.

2. Looking for the above mentioned identifiying string at the top of an XML file using a strict regular expression which starts explicitly looking from the first byte would fail if there was a comment tag above it (eg. <!– this is an XML file –>).

3. What if the file contains a Byte Order Mark which is mandatory for UTF-16 and UTF-32 XML (optionally for UTF-8) and you explicitly start searching at byte 1? Again: your pretty signature would fail! Of course you could enhance your expression to take BOM’s into account, but it fails as per example 2.

johan wrote 18 weeks 6 days ago

Validation not needed, check for well formedness is enough

You’re right; your first example is actually something I had thought of myself as well. I suppose you could get around the situations described under 2 and 3 by only looking at an (arbitrary) number of leading bytes, but that still wouldn’t work for all possible cases.

One thing though: you explicitly mention validating. In my original blog post I deliberately avoided the word validation and limited things to a well-formedness check. The reason for this is simple: to truly validate an XML file you also need to check conformance to any schemas and DTDs that are referred to, which means that these external references all need to be downloaded. This will instantly make things slow (besides, XML that is not valid is still XML).

For identification a simple check for well-formedness is enough, and there’s no need to go beyond this.

mauricederooij wrote 18 weeks 6 days ago

Re: well-formedness check

I agree that I might have been to harsh using the word “validating”.

You are right: a well formedness check should be enough if the parser decides it is an XML file.

andy jackson wrote 18 weeks 6 days ago

Even well-formed is too strict for me!

I’d personally be happy to drop the well-formedness constraint and use a forgiving parser like Tidy. If a file is only one open tag away from being valid XML, I’d like to get ‘XML plus one error’ rather than ‘plain text’.

blekinge wrote 16 weeks 6 days ago

Xml or Not is not interesting

You might be familiar with my somewhat older post

http://www.openplanetsfoundation.org/blogs/2011-02-17-new-direction-file-characterisation

The basic problem and idea is that xml is so widely used, being told that a file is xml, is akin to being told that it is text. If we went one step further, and could identify the kind of xml, from the namespaces, we would be able to tell all whole lot more to the user.

Since your python script is parsing the file anyhow, I do not think it would be a great difficulty to have it report on the contents of the file. The root element namespace, if any, should be easy to extract.

johan wrote 16 weeks 6 days ago

You’re right, but slightly different scope

Hi Asger,

You’re right, and I was actually referring to that (and your earlier blog post on this) in the final section.

However this was all somewhat beyond the scope of my post, as the main issue is simply that the most commonly used signature-based tools (DROID, Fido, Unix File, etc.) aren’t even able to tell you the container format (=xml) in many cases. This might seem completely trivial, but knowing the container format is still a prerequisite for any more sophisticated characterisation, such as extracting namespace info.

The reason for the script is merely to illustrate that this is something that, should there be a need for this, could be added to Fido without much difficulty and with minimal loss of performance. I also completely agree with you that extending the code to extract namespace info should be pretty straightforward as well. I deliberately didn’t go into this in any detail because this was already covered by your earlier blog post.

Open Planets Foundation – A community hub for digital preservation