Bridging the gap between real world repositories and scalable preservation environments

4Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Integrating large-scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, has long proved to be a daunting task. In this paper, we will show how this integration can be achieved using software developed in the scalable preservation environments (SCAPE) project, and also how it can be achieved using a local more direct implementation at the Danish State and University Library inspired by the SCAPE project. Both allow full use of the Hadoop system for massively distributed processing without causing excessive load on the repository. We present a proof of concept SCAPE integration and an in-production local integration based on repository systems at the Danish State and University Library and the Hadoop execution environment. Both use data from the Newspaper Digitisation Project, a collection that will grow to more than 32 million JP2 images. The use case for the SCAPE integration is to perform feature extraction and validation of the JP2 images. The validation is done against an institutional preservation policy expressed in the machine readable SCAPE Control Policy vocabulary. The feature extraction is done using the Jpylyzer tool. We perform an experiment with various-sized sets of JP2 images, to test the scalability and correctness of the solution. The first use case considered from the local Danish State and University Library integration is also feature extraction and validation of the JP2 images, this time using Jpylyzer and Schematron requirements translated from the project specification by hand. We further look at two other use cases: generation of histograms of the tonal distributions of the images; and generation of dissemination copies. We discuss the challenges and benefits of the two integration approaches when having to perform preservation actions on massive collections stored in traditional digital repositories.

Cite

CITATION STYLE

APA

Jurik, B. A., Blekinge, A. A., Ferneke-Nielsen, R. B., & Møldrup-Dalum, P. (2015). Bridging the gap between real world repositories and scalable preservation environments. International Journal on Digital Libraries, 16(3–4), 267–282. https://doi.org/10.1007/s00799-015-0152-4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free