Project to Identify files with linked dependencies

Submitted by ecochrane on 21 November 2011 – 4:23am

Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product.

Currently we have only anecdotal evidence as to the prevalence of linked files in the digital universe. It would be really useful to be able to understand the scale of the issue and identify the prevalence of linked files in the material that we ingest. Archives New Zealand and Victoria University have recently intiated a project that we hope will go some way towards achieving this.

A student from the School of Engineering and Computer Science at Victoria University recently started work on a new summer project at Archives New Zealand. The student, Niklas Rehfeld, is funded through a summer scholarship jointly provided by Archives New Zealand and Victoria university.

Over the next 10 weeks Niklas will be working on a project to investigate linked files and build a tool to identify them. Specifically, the aim of this project is to develop a prototype tool to identify when computer files formatted in the Microsoft Office 1997-2003 formats link to other computer files and which files they link to (in order to identify the component files that make up the complex digital object).

The technical work will involve the following:

Analysis of the Microsoft specifications to determine how document linking and other metadata that maybe of use for preservation purposes is implemented for Word, Excel and Powerpoint documents for the period 1997-2003.
Review of existing frameworks and related tools such as the open source “format identification, validation, and characterization” tool JHOVE.
Writing a specification for a modular tool for identifying linked documents given a root Microsoft Office document. As part of the specificion will be an evaluation of the feasibility of extending an existing tool versus creating a standalone implementation from scratch.
Implementation of a prototype tool for at least one document format. Time permitting, the tool will be extended either to handle a wider range of document formats or a wider range of preservation metadata.
Testing of the tool against a selection of files supplied by National Archives.

This project is a research project first and foremost. There is no guarantee that a working tool will be produced from it. However if a useful tool is produced the intention is to release it as an open source product that anyone can incorporate into their preservation workflows.

Niklas’s first steps will include looking at the various tools that are out there that may be able to be extended to perform the function outlined above. These include JHOVE, JHOVE 2, DROID and the National Library of New Zealand metadata extraction tool. In addition he will be investigating the available java libraries that he may be able to use for this purpose.

If anyone in the OPF community has any advice on best places to start with this project or any other advice they would like to offer we would greatly appreciate it. We have a number of good leads already but would appreciate any help the community could offer.

ecochrane’s blog
Login or register to post comments

7 comments

mattpalmer1086 wrote 4 days 1 hour ago

apache poi

In Droid we used the Apache poi library to process office documents. We actually only used the poifs component (the part which reads ole2 files as a file system) which works very well. There are also more detailed apis to deal with more advanced concepts in office files.

ecochrane wrote 3 days 7 hours ago

Apache poi Library

Thanks for this suggestion Matt.

One of Niklas’s first steps was to look into the libraries used by tools such as DROID. Your message confirmed that the Apache poi library was one of the best options for use in this project.

andy jackson wrote 2 days 22 hours ago

Related AQuA code

As part of the AQuA project, I started writing a basic Office binary-format analyser using Apache POI. The aim there was to identify the version of Office/Word/whatever that created a file, but I had difficulty determining whether the supplied version identifier denoted the authoring version or the compatability version. e.g. if you Save As 2003-2007 or whatever from Office 2010, can you tell both the source app. version (2010) and the target compatability version (2003-2007) from the file alone?

I then started exploring the POI API to extract further properties, e.g. lists of fonts (declared rather than used, I think). Feel free to fork/copy the code if it helps.

See https://github.com/openplanets/AQuA/tree/master/office-analyser

Dirk von Suchodoletz wrote 3 days 14 hours ago

Implications of Complex Objects on View Paths

The findings of this investigation might directly influence the concept of view paths. View paths or pathways, depending on which literature you are referring too, are a formalization of the sets of actual technical meta data to compute the requirements for accessing, rendering or running a specific digital object of a certain type. For simple objects like self containing document files the view path usually seen as a linear vector from the object of interest into the actual working environment (software, hardware combination) of the user. The object type determines the rendering application (more or less). Then a number of additional software (secondary) objects like operating systems, helper programs and drivers are to be taken into account.

If the objects get more complex the view path might have to reflect this as there could be additional applications or components be required, e.g. to render a movie that is linked into a presentation document like Power Point. In such cases the traditional view path model needs to be more versatile to deal with such complex objects as the assumed linearity is not longer sufficient. Additional applications or other components have to be considered to render all components of the object. This might require a much more complex setup of a suitable environment.

The problem is not limited to traditional linked documents but manifests in other scenarios like content management systems (CMS) too. CMSes consist of several well identifyable sub-objects like the database and the webserver components plus other helper tools. To preserve such a system outside its original environment, if e.g. direct system imaging should be avoided for some reason, a number of different components are required.

ecochrane wrote 3 days 7 hours ago

View paths

Thanks for the comments Dirk.

Identifying “complex objects” (for want of a better term), that include multiple files that each require different rendering applications, does seem like it will lead to more complex view paths. However it should be feasible, in most cases, to use the current mechanisms to identify the right applications to render each individual file in a complex object and then bundle/provide them all within the same emulated environment.

Problems may arise when there are multiple applications that render one dependent object but only one of those is compatible with the rendering application of the primary file within the complex object. For example If a .doc file has a .xls spread sheet embedded in it via a link and it is opened with Microsoft Word on a system that has OpenOffice.org Calc installed on it and not Microsoft Excel then this may mean that the embedded spread sheet is unable to be automatically opened within the Microsoft Word document interface as it was originally able to be. This may mean that the user is unable to know which data from the spread sheet was presented to the user through the document interface.

I suspect this example does not actually exist as in that particular case there is often a “screenshot” added to the .doc file showing the last known view of the data from the spread sheet as it was when the document was saved. However there may be other examples where such complexity causes problems.

There is certainly complexity to be found here but that doesn’t make it complicated. The first step to solving these problems will be to identify the scope of the issue. I hope that this project will result in a mechanism to help us take this first step.

andy jackson wrote 2 days 23 hours ago

Tracing dependencies…

As we discussed at iPres, it would be possible in principle to reveal these dependencies using my kernel tracing technique (click here for pre-print), but unfortunately, there is no good system call tracing tool for Windows, as far as I can tell.

However, the STraceNT codebase is very close to what we need, and if someone wants to take a look at improving it I’d be glad to help steer things along. We would need to understand the Windows system call API well enough to know which calls correspond to opening and reading files or spawning child processes, and then add code to report the file names involved.

Any Windows hackers out there want to give it a go?

mauricederooij wrote 1 day 19 hours ago

Re:tracing dependencies

This is a very interesting approach. However, in our case we would have to create some sort of webservice to analyse “Windows-only” files since almost all our stuff is running on Linux.

Here is an interesting article I have found on this subject: http://www.codeproject.com/KB/system/hooksys.aspx

Open Planets Foundation – A community hub for digital preservation