When is a PDF not a PDF? Format identification in focus.

In this post I’ll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I’ll restrict myself to a single issue and be as light on technical details as possible. I hope I’ll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.

Assumptions

I’m considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately as possible then pass the file to format specific tools for deeper analysis.

I’ll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.

Software and data

I performed the tests using the selected govdocs corpora (that’s a large download BTW) that I mentioned in my last post. I chose four format identification tools to test:

the fine free file utility (also known simply as file),
DROID,
FIDO, and
Apache Tika.

I used as up to date versions as possible but will spare the details until I publish the results in full.

So is this a PDF?

So there was plenty of disagreement between the results from the different tools, I’ll be showing these in more detail at our upcoming PDF Event. For now I’ll focus on a single issue, there are a set of files that FIDO and DROID don’t identify as PDFs that file and Tika do. I’ve attached one example to this post, Google chrome won’t open it but my ubuntu based document viewer does. It’s a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I’ve not systematically tested multiple readers yet but Libre Office won’t open it while ubuntu’s print preview will. Feel free to try the reader of your choice and comment.

What’s happening here?

It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use “magic” or “signature” based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here’s the Tika 1.5 signature for PDF:

<match value=”%PDF-” type=”string” offset=”0″/>

What this says is look for the string %PDF- (the value) at the start of the file (offset=”0″) and if it’s there identify this as a PDF. The attached file indeed starts:

%PDF-1.2

meaning it’s a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:

<InternalSignature ID=”125″ Specificity=”Specific”>
<ByteSequence Reference=”BOFoffset”>
<SubSequence MinFragLength=”0″ Position=”1″
SubSeqMaxOffset=”0″ SubSeqMinOffset=”0″>
<Sequence>255044462D312E32</Sequence>
<DefaultShift>9</DefaultShift>
<Shift Byte=”25″>8</Shift>
<Shift Byte=”2D”>4</Shift>
<Shift Byte=”2E”>2</Shift>
<Shift Byte=”31″>3</Shift>
<Shift Byte=”32″>1</Shift>
<Shift Byte=”44″>6</Shift>
<Shift Byte=”46″>5</Shift>
<Shift Byte=”50″>7</Shift>
</SubSequence>
</ByteSequence>
<ByteSequence Reference=”EOFoffset”>
<SubSequence MinFragLength=”0″ Position=”1″
SubSeqMaxOffset=”1024″ SubSeqMinOffset=”0″>
<Sequence>2525454F46</Sequence>
<DefaultShift>-6</DefaultShift>
<Shift Byte=”25″>-1</Shift>
<Shift Byte=”45″>-3</Shift>
<Shift Byte=”46″>-5</Shift>
<Shift Byte=”4F”>-4</Shift>
</SubSequence>
</ByteSequence>
</InternalSignature>

Which is a little more complex than Tika’s signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference=”BOFoffset”> section, a begining of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here.

The start condition change, i.e. Tika’s “%PDF-” vs. DROID’s “%PDF-1.2%” is to support DROID’s capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It’s also NOT the cause of the problem. The disagreement between the results is caused by DROID’s requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn’t contain an %EOF marker.

So who’s right?

An interesting question. The PDF 1.3 Reference states:

The last line of the file contains only the end-of-file marker,
%%EOF. (See implementation note 15 in Appendix H.)

The referenced implementation note reads:

3.4.4, “File Trailer”
15. Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file.

So DROID’s signature is indeed to the letter of the law plus amendments. It’s really a matter of context when using the tools. Does DROID’s signature introduce an element of format validation to the identification process? In a way yes, but understanding what’s happening and making an informed decision is what really matters.

What’s next?

I’ll be putting some more detailed results onto GitHub along with a VM demonstrator. I’ll tweet and add a short post when this is finished, it may have to wait until next week.

Preservation Topics:

Identification

Attachment	Size
It looks like a PDF to me….	44.06 KB

Submitted by Carl Wilson on 21 August 2014 – 10:40am

Comments

I think DROID really shouldn’t be doing this …

As you already say, by looking at the EOF marker DROID really adds an element of format validation. I don’t think DROID should be doing this, as in this case it undermines its core functionality. E.g. if someone submits me a truncated PDF file (in which case the EOF marker is missing), I think it’s still helpful to be able to establish that the format is PDF. This aside, lots of formats have specific EOF markers, so including them for PDF and not including them for other formats seems a bit arbitrary to me.

Submitted by Johan van der Knijff on 21 August 2014 – 2:58pm Permalink

On the other hand…

… it is not too bad to know that something is wrong here, as this PDF file causes some of my tools to crash. Maybe it’s not right to not to IDENTIFY it as a PDF, but it might be the most adequate thing to state that this in indeed a PDF file, but it is a defekt or somehow incomplete PDF-file, which might cause problems. (Wasn’t there the nice term “malformed”? I think this is a great example of a malformed PDF file).

Best, Yvonne

Submitted by Yvonne Friese on 22 August 2014 – 8:38am Permalink

I stand to be corrected but

I stand to be corrected but TNA’s use case for DROID/PRONOM involves that moderate amount of validation, probably so they can reject files that are broken (and request another copy?). You can argue that not ingesting invalid files is a perfectly valid scenario, if you are in the position to do it.

Just remembered this: “While DROID is not intended as a validation tool, we find this to suit our purposes and to highlight potential issues long before a formal ‘ingest’ process kicks off.” -> https://groups.google.com/forum/#!topic/droid-list/sUCwaO1k1kk

Submitted by William Palmer on 22 August 2014 – 8:40am Permalink

Ideal identification

The 1.7 (ISO) version of the specification tightens this up a bit (see here), but for me DROID is doing the wrong thing. It should not be essentially validating the PDF trailer beyond that required by the pre-1.7 specification.

However, even if the trailer was standardised, and even if all usable PDF documents implemented the standard correctly, I would still rather that an accidentally truncated PDFs that snuck into the web archive were identified as such, rather than identification failing completely. This is why I combine Apache Tika and DROID when profiling our web archives.

Submitted by Andy Jackson on 21 August 2014 – 3:33pm Permalink

What is a file format?

This provides a great example for the discussion/question and answer session going on at the Digital Preservation Q & A site here.

Submitted by Euan Cochrane on 21 August 2014 – 5:21pm Permalink

286177.pdf

Hi,

I have run some test for fun. PDF24 (based on Ghostscript) purports to be able to fix it, but the PDF generated with it only contains a blank page.

iText just runs into an exception: com.itextpdf.text.exceptions.InvalidPdfException: Rebuild failed: trailer not found.; Original message: PDF startxref not found.

Acrobat won’t open it anyway. Notepad++ of course does, but the only thing I can tell is the PDF version (1.2).

That’s interesting, I’ll keep this file sample in my “difficult PDF files-folder”.

Talk to you later, Yvonne

Submitted by Yvonne Friese on 22 August 2014 – 8:38am Permalink

Acrobat test

Hi Yvonne,

It seems that opening a PDF in Acrobat is the real test and if it doesn’t open in that then the file is probably broken in some way… I know from issue reports I have submitted to Apache PDFBox that some PDF files in the Govdocs1 corpus are broken, and this may well be another one.

Regards,

Will

Submitted by William Palmer on 22 August 2014 – 8:53am Permalink

Open in Acrobat isn’t scalable, is it?

Hi,

yes, the true and real test would be to just try to open the PDF file in Acrbobat.

But.

How do I automate this with my more than 70,000 PDF-files? Right now I have some easy-to-automate-tests like looking for the “&PDF”-tag, trying to open with the iText PdfReader, checking for encryption and stuff.

I would think that the iText PdfReader and the Acrobat Reader more or less would open the same files and reject others, but have never tested this large-scale.

Best, Yvonne

Submitted by Yvonne Friese on 22 August 2014 – 9:02am Permalink

It’s a bit more complicated than that …

I’ve encountered broken PDFs that display (more or less) normally in GSView while looking like garbage in Acrobat. Another thing is that, more generally speaking, Acrobat is quite forgiving when it comes to a number of common malformations. One example: I’ve come across PDFs where the file header (i.e. the %PDF-1.2 bit) is preceded by garbage bytes, and Acrobat quietly ignores those, whereas GSView and several other viewers will choke on them. So you cannot really make generalisations about PDFs being broken/not broken based on the behavior of only one viewer…

Submitted by Johan van der Knijff on 25 August 2014 – 10:24am Permalink

Office

I think we agree, but I’m going to add this:

This situation is somewhat analogous to the Microsoft Office formats; Microsoft are the originator of the standards and the reference implementation. Files that MS Office won’t open may open elsewhere (LibreOffice etc), but then how do you know if they are “valid” or whether they have opened correctly without anything missing?

If those PDFs will open in GSView, but not others/Acrobat, can you really know if there is anything not being displayed? But at that point it’s probably more to do with data recovery. Did the file that displayed garbage in Acrobat cause any messages/prompts to be displayed when it was opened and had it been assessed as broken by Preflight etc?

One (recently Googled and unconfirmed) reason for the additional bytes before %PDF could be this: “PDF files are suppose to start with the sequence “%PDF-X.Y”; however, some programs, email programs are notorious, will add a header, such as Mac Binary. Acrobat looks in the first 1024 bytes for the %PDF sequence. Other applications only support %PDF at the beginning of the file.” (from http://stackoverflow.com/a/1456625).

Given the age and distribution of the Acrobat software, plus the fact that it comes from Adobe, I’d give more weight to “does it open in Acrobat” (without any error/warning messages), than any other viewer, although that test is not scalable. But if the broken PDFs will even partially open in other viewers then that’s great, especially if the PDF may be the only copy in existence.

Submitted by William Palmer on 26 August 2014 – 9:34am Permalink

Acrobat and scalability

Hi Will, Yes, I absolutely agree that for most practical purposes Acrobat is much more important than any other viewer. As for the (lack of) scalability of that test, it’s worth adding that Adobe publishes an Adobe PDF Library SDK:

http://www.adobe.com/devnet/pdf/library.html

Those libraries are identical to the ones used by Acrobat, which means it’s probably not too difficult to use that SDK to create a scalable equivalent of the manual “open-in-Acrobat-and-see-what-happens” test. In fact I would be surprised if someone out there hasn’t done this already.

Submitted by Johan van der Knijff on 26 August 2014 – 11:49am Permalink

Excited!

Now I am excited, thank you a lot!

edit: Of course I cannot just download the library/search it via maven and connect it and embedd it in my tools. Would have been cool, though…

Submitted by Yvonne Friese on 26 August 2014 – 12:19pm Permalink

PDF Policy

Wow, the more I learn about PDF problems, the more difficult it gets!

You just gave me a very good reason for formulating a file format policy about PDF in terms of digital curation.

We do have one at the moment, but I think this has room for improvement.

Best, Yvonne

Submitted by Yvonne Friese on 25 August 2014 – 10:55am Permalink