Over the last few weeks I’ve been working on the design of a workflow that the KB is planning to use for the migration of a collection of (mostly old) TIFF images to JP2. One major risk of such a migration is that hardware failures during the migration process may result in corrupted images. For instance, one could imagine a brief network or power interruption that occurs while an image is being written to disk. In that case data may be missing from the written file. Ideally we would be able to detect such errors using format validation tools such as JHOVE. Some time ago Paul Wheatley reported that the BL at some point were dealing with corrupted, incomplete JP2 files that were nevertheless deemed “well-formed and valid” by JHOVE. So I started doing some experiments in which I deliberately butchered up some images, and subsequently checked to what extent existing tools would detect this.
I started out with removing some trailing bytes from a lossily compressed JP2 image. As it turned out, I could remove most of the image code stream (reducing the original 2 MB image to a mere 4 kilobytes!), but JHOVE would still say the file was “well-formed and valid”. I was also able to open and render these files with viewer applications such as Adobe Photoshop, Kakadu’s viewer and Irfanview. The behaviour of the viewer apps isn’t really a surprise, since the ability to render an image without having to load the entire code stream is actually one of the features that make JPEG 2000 so interesting for many access applications. JHOVE’s behaviour was a bit more surprising, and perhaps slightly worrying.
This made me wonder about a way to detect incomplete code streams in JP2 files. A quick glance at the standard revealed that image code streams should always be terminated by a two-byte ‘end of codestream marker’. As this is something that is straightforward to check, I fired up Python and ended up writing a very simple JP2 file structure checker. Since the image code stream in JP2 does not have to be located at the end of the file (even though it usually is), it is necessary to do a superficial parsing of JP2’s ‘box’ structure (which is documented here). So I thought I might as well include an additional check that verifies if the JP2 contains all required boxes.
In brief, when jp2StructCheck analyses a file, it first parses the top-level box structure, and collects the unique identifiers (or marker codes) of all boxes. If it encounters the box that contains the code stream, it checks if the code stream is terminated by a valid end-of-codestream marker. Finally, it checks if the file contains all the compulsory/required top-level boxes. These are:
In order to test the box checking mechanism I did some additional image butchering, where I deliberately changed the tags of existing boxes so that they wouldn’t be recognised. When I subsequently ran these images through JHOVE, this revealed some additional surprises. For instance, after changing the markers of the Contiguous Codestream box or even the JP2 Header box (which effectively makes them unrecognisable), JHOVE would still report these images as “well-formed and valid” (although in the case of the missing JP2 Header box JHOVE did report an error).
It is important to note here that jp2StructCheck only checks the top-level boxes. In case of a superbox (which is a box that contains child boxes), it does not recurse into its child boxes. For example, it does not check if a JP2 Header box (which is a superbox) contains a Colour Specification box (which is required by the standard). So the scope of the tool is limited to a rather superficial check of the general file structure. It is not a JP2 validator, and it is certainly not a replacement for JHOVE (which performs a more in-depth analysis)! The main scope is to be able to detect certain types of file corruption that may occur as a result of hardware failure (e.g. network interruptions) during the creation of an image.
In addition, the fact that a code stream is terminated by and end-of-codestream marker is no guarantee that the code stream is complete. For instance, if due to some hardware failure some part of the middle of the codestream is not written, jp2StructCheck will not detect this! It may be possible to improve the level of error detection by including additional codestream markers. This is something I might have a look at at some later point.
I created a Github repository that contains the source code of jp2StructCheck, some documentation, and a small data set with some test images.
As some people may not want to install Python on their system, I also created a binary distribution that should work on most Windows systems.
The documentation (in PDF format) is here.
Finally, use this link to download the test images.
I’m curious to hear if anyone finds jp2StructCheck useful at all, so please feel free to use the comment fields below for your feedback (including reports on any bugs that may exist).
By now jp2StructCheck which is a full-fledged JP2 validator. See this more recent blog post for details.
Johan van der Knijff
KB / National Library of the Netherlands
Comments
Full JP2 parse?
Excellent work Johan, we will definitely be testing this out in the next week or two.
You carefully noted the limitations of your tool. Do you think there is a genuine need to do a more thorough parse of the complete JP2 structure?
If there was such a need, what would be the best way of meeting it? I’m guessing this would require considerably more development and testing time to extend your tool to do this? Is there potential to leverage an existing tool to do this checking, despite their current tolerance as you noted in your post?
Re: Full JP2 parse?
Whether a more thorough parse is needed really depends on what you’re after. The scope of my tool is mainly limited to production situations where you have some prior knowledge on the quality and characteristics of the JP2s that are produced by your encoder (based on prior analyses with e.g. JHOVE and ExifTool). In that case the main issue is to detect things like incomplete images, which typically result from system failures.
Extending my tool to a full JP2 parse/validation would basicaly replicate what’s already being done in JHOVE, and I’m not too sure if that’s the best way to go. A more sensible approach may be to add the functionality of my my tool to JHOVE (or JHOVE2?). That should be pretty straightforward anyway.
Having said that, I’m not entirely sure about the current status of JHOVE, and it’s not clear to me where JHOVE2 is going with its announced JPEG 2000 module either.
On a side note, the general file structure of a JP2 image is pretty straightforward, and adding some more sophisticated checks to my tool is certainly doable (and wouldn’t even be too time consuming).
But the question is if we really need yet another tool, or concentrate the efforts in improving existing ones.
JHOVE2
The status of JHOVE2 is that it is coming to the end of it’s current funding, but the project members are committed to basic maintenance and code management. Having spoken to them at the Rome event, it was clear that they would be willing to fold modifications from third-parties into future releases. In fact, I think they were keen to work in that way, and that this was one of the reasons for moving to BitBucket.
I’m personally very keen that we should contribute to JHOVE2, and encourage future ‘deep characterisation’ development efforts to be focussed there. It’s not perfect, and needs some modularisation/refactoring as well new features and modules, but I’ve experimented with it and found it pretty good to work with. We’ll need to engage with them frequently and directly, find out about their plans (e.g. PDF and JP2), agree roadmaps and so on. We’ll also have to engage with some of the other projects, e.g. JHOVE, to find out where they are headed. This will take some effort, but I think we need to work that way to make things more sustainable. I’ll post a follow up blog about this soon.
On first impressions, I would guess that porting the JHOVE JP2 module over to JHOVE2 and adding your new functionality should not be that onerous a task. Not trivial, but only a few developer-weeks rather than months, I’d hope.
By the way ..
I’m really curious if this works for the incomplete images that you presented at the SCAPE kickoff. If it’s simply a matter of a truncated code stream it should do the trick, but maybe there’s some more complex corruption going on …
Test test test…
Yes I suspect what you have developed will pick up the errors we saw, but it will need some thorough testing. We will shortly have the opportunity to do some runs over quite a large dataset that includes some truncated images and we will report back what we find. I think answering my earlier questions needs to be data driven if possible and hopefully some of our testing can inform our next developments.
I’m also in the process of writing up your tool on the SCAPE wiki. I’ll send you a link when I’ve finished. For you none SCAPErs, this is currently in a private area but we hope to make it (and details of the other preservation issues and solutions SCAPE is working on) available in the next month or two.
Having second thoughts on this
I was giving your comments some more thought over the weekend, and started wondering how much additional work would be involved in extending the tool to do a full-on JP2 validation. Having thought about this for a bit, my preliminary estimate is: not very much at all! From the top of my head, I think what’s basically needed is this:
In addition we would need a simple output handler that reports the results of the analysis in some format that is both human and machine readable (obviously XML).
With the above changes I think we would have something that does a pretty solid validation. My estimate would be that most of this could be done in a matter of just a few weeks.
This actually might be worth having a go at, provided that I can reserve a couple of weeks to work on this, and anyone being interested at all in using it at some point!
I think this is definitely something we should discuss next week during your visit to The Hague.
*) In anticipation of upcoming changes to JPEG 2000 Part 1
JHOVE
Well-formedness vs validity
Hi Gary,
thanks for responding to this. If JHOVE only checks the signatures and structure of the file (at least in the case of JP2), this basically means that it only checks for well-formedness, and not validity.
I think what makes things slightly confusing is that if JHOVE encounters a JP2 image with a missing JP2 header or codestream box, it will say the file is well-formed and valid, whereas in reality such files are only well formed (but not valid!). So I think what JHOVE does here makes perfect sense, apart from making any statements about validity (which it doesn’t really check to begin with!).
BTW I fully understand the decision not to check the entire code stream. In fact my tool doesn’t do any code stream validation either -it only checks if it is terminated by an end-of-codestream marker (basically a very rough well-formedness check on the code stream).
Cheers,
Johan
Well-formed versus valid
It’s probably worth pointing out that the distinction between well-formed-ness and validity is only really tightly defined for XML. The JHOVE2 design dropped it as a generic distinction, as not all formats support it.
However, this does not mean that we could not meaningfully define the distinction between them. For me, for a JP2, validity would imply conformance to a particular JP2 profile (by analogy with XML Schema), specifying, for example, the presence of a well-formed codestream!