ARC to WARC migration: How to deal with de-duplicated records?

Quattro Pro for DOS: an obsolete format at last?

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further. This is why I am now proposing one possible way to deal with de-duplicated records in an ARC to WARC migration scenario.

Before entering into specifics, let me briefly recall what is meant by „de-duplication“: It is a mechanism used by a web crawler to reference identical content that was already stored when visiting a web site at a previous point in time, and the main purpose is to avoid storing content redundantly and by that way to reduce the required storage capacity.

The Netarchive Suite uses a Heritrix module for de-duplication, which takes place on the level of a harvest definition. The following diagram roughly outlines the most important information items and their dependencies.

arc2warc-deduplication.png

The example shows two subsequent jobs executed as part of the same harvest definition. Depending on the configuration parameters, as the desired size of ARC files, for example, each crawl job creates one or various ARC container files and a corresponding crawl metadata file. In the example above, the first crawl job (1001) produced two ARC files, each containing ARC metadata, a DNS record and one HTML page. Additionally, the first ARC file contains a PNG image file that was referenced in the HTML file. The second crawl job (1002) produced equivalent content except that the PNG image file is not contained in the first ARC file of this job, but it is only referred to as a de-duplicated item in the crawl-metadata using the notation {job-id}-{harvest-id}-{serialno}.

The question is: Do we actually need the de-duplication information in the crawl-metadata file? If an index (e.g. CDX index) is created over all ARC container files, we know – or better: the wayback machine knows – where a file can be located, and in this sense the de-duplication information could be considered obsolete. We would only loose the information as part of which crawl job the de-duplication actually took place, and this concerns the informational integrity of a crawl job because external dependencies would not be explicit any more. Therefore, the following is a proposed way to preserve this information in a WARC-standard-compliant way.

Each content record of the original ARC file is converted to a response-record in the WARC file like illustrated in the bottom left box in the diagram above. Any request/response metadata can be added as a header block to the record payload or as a separate metadata-record that relates to the response-record.

The de-duplicated information items available in the crawl-metadata file are converted to revisit-records as illustrated in the bottom right box as a separate WARC file (one per crawl-metadata file). The payload-digest must be equal and should state that the completeness of the referrred record was checked successfully. The WARC-Refers-To property refers to the WARC record that contains the record payload, additionally, the fact that Content-Length is 0 explicitely states that the record payload is not available in the current record and that it is to be located elsewhere.

Leave a Reply

Join the conversation