EPUB for archival preservation

EPUB for archival preservation

Over the last few years, the EPUB format has gained widespread popularity in the consumer market. The KB has been approached by a number of publishers that wish to use EPUB for delivering some of their electronic publications. Surprisingly little information is available on the format's suitability for archival preservation, apart from Library of Congress' Sustainability of Digital Formats webpages, which contain entries on EPUB 2 and EPUB 3.

So, the KB's Departments of Collection and Collection Care requested a more detailed investigation of EPUB's preservation credentials. More specifically, answers were needed to the following questions:

  • What are the main characteristics of EPUB?

  • What functionality does EPUB provide, and is this sufficient for representing e.g. content with sophisticated layout and typography requirements?

  • How well is the EPUB supported by software tools that are used in (pre-)ingest workflows?

  • How suitable is EPUB for archival preservation? What are the main risks?

EPUB for archival preservation

The report EPUB for archival preservation is a first attempt at answering these questions as well as possible. It starts out with a simple example that illustrates the general structure of an EPUB file, followed by a more in-depth discussion of specific aspects of the format. It then covers functionality-related aspects such as layout, appearance and multimedia support, and the main differences between EPUB 2 and EPUB 3.

Support by characterisation tools is important for processing EPUB files in an operational workflow, so a brief review (and some preliminary tests) of relevant identification, validation and feature extraction tools is included as well.

To assess the overall suitability of EPUB for preservation, the format was evaluated against a set of widely used criteria (mainly from The National Archives and Library of Congress). The final chapter wraps up the main conclusions, and suggests a number of recommendations.

Community input

Since it appears that not much has een published on EPUB within an archival preservation context so far, we would really appreciate to hear your thoughts on the report. Is anything important missing? Did I overlook any relevant tools? Is there anything in particular that you strongly disagree with? Please use the comment fields below to let us know!

In addition, the final chapter contains two subsections with Community Recommendations and Tool Recommendations. These are all things we can do as a community to simplify the use of EPUB in archival settings. Please consider getting involved if you feel you could make a contribution.

Update: revised version of report (July 2012)

Based on the feedback I received on the original report (i.e. the June 18 edition) I just released a revised edition. The main changes are:

  • Re-ran tests for Unix File with a more recent (5.11) version of this tool

  • Included tests of Apache Tika (identification + feature extraction)

  • Included tests of FlightCrew (validation)

Update: follow-up to the report (May 2013)

A substantial number of EPUB-related developments have happened since this report was published, and as a result some of its findings and conclusions have become outdated. This applies in particular to the observations on EPUB 3, and the support of EPUB by characterisation tools. See this follow-up blog post for a more up to date review of these subjects.

Link to report

EPUB for archival preservation, KB/ National Library of the Netherlands


Johan van der Knijff
KB / National Library of the Netherlands

4 Comments

  1. andy jackson
    June 25, 2012 @ 12:31 pm CEST

    Hi Johan,

    This report is great! I did have one comment to make on section 7.1 (Ubiquity, support and interoperability) – I read this…

    “as of 2011 (EPUB) is the most widely supported vendor‐independent XML‐based e‐book format”

    …and I have to point out that while strictly true, this glosses over an important issue. The eBook market is currently dominated by proprietory formats (Mobi, iBooks, etc.) and although EPUB is the most popular vendor-independent format, it is a relatively minor player overall. Like the ‘browser wars’ of years past, we are in the middle of the ‘eBook wars’, and while I am hopeful that the industry will agree to adopt some future version of some format derived from EPUB, I fear this might take as long as it took to get from the BLINK tag to HTML5.

  2. johan
    June 20, 2012 @ 9:49 am CEST

    This comment field is mainly to remind myself of a few things I might look into for the final version of the report:

    • Alternative EPUB validator: http://code.google.com/p/flightcrew/. Will do some tests on that later.
    • Misty De Meo wrote via Twitter that File‘s libmagic 5.11 detects EPUB files as “application/epub+zip”. Will check this out as well.

  3. johan
    June 19, 2012 @ 4:45 pm CEST

    Hi Peter,

    Thanks for your comment! Good to hear about Tika, might add that to a later version. As for metadata extraction, the one tool that I tested also only looked at the Package Document, which isn’t particularly helpful for detecting DRM, encryption etc. However all these things are really quite simple to implement because you just need to check for the presence of a number of particular resources at some fixed locations.

    OK, have a train to catch now, more later!

    Johan

  4. pmay
    June 19, 2012 @ 4:01 pm CEST

    Hi Johan,

    Excellent report, thanks.  I’ve just done a quick experiment with Apache Tika, running over the 26 IDPF EPUB files to see its characterisation performance – it managed to identify all 26 as “application/epub+zip”.

    As far as its metadata extraction capabilities, it seems able to extract metadata from the Package Document (the <metadata> element), however I suspect that’s about it. It did also throw up a few errors about “Composite properties not including other composite properties” when parsing some files, so it’s not a perfect!

    Cheers

Leave a Reply

Join the conversation