Metadata Harvesting and the Open Archives Initiative
Available from www.arl.org
Page 1
Metadata Harvesting and the Open Archives Initiative
62
METADATA HARVESTING AND THE OPEN
ARCHIVES INITIATIVE
Purushothama Gowda M M K Bhandi
Abstract
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a collaborative
effort that provides an application- independent interoperability framework based on
Metadata Harvesting. Though the OAI-PMH is a very recent development it is being regarded
as an important step towards information discovery in the digital library arena. This paper
looks into the issues leading to its development as well as gives an inside view of the
proposed model.
Keywords: Metadata Harvesting Protocol, Open Archives Initiative, 239.50, Repositories
1 Introduction
There has been considerable confusion about the Open Archives Initiative (OAI)- Metadata Harvesting
Protocol (MHP), mostly beginning with and stemming from its name. The protocol no longer has much to
do with archiving or archives, other than in terms of its heritage. The OAI-PMH is a means of making
machine-readable metadata widely available for use. The Open Archives Initiative was originally proposed
to enhance access to e-print/pre-print archives. Gradually, however, the scope of the initiative has broadened
to cover any kind of digital content including images and videos. It is available to all regardless of
economic mechanism surrounding the content. The fundamental idea here is that authors would deposit
preprints and/or copies of published versions of their articles into such servers, thus providing readers
worldwide with a free way of obtaining access to these papers, without needing paid subscription access
to the source electronic journals. The proponents of this movement argue that the refereed scholarly
journal literature really belongs to the scholarly community and by extension to the world at large, and that
such free access is better aligned with the interests of both authors and readers. The deposit of preprints
would also speed up and democratize the frontiers of research and access to new knowledge; instead
of a privileged circle of members of “invisible colleges” sharing preprints, these preprints would be
available to everyone immediately, without the delays introduced by the journal refereeing and publication
cycle. Proposals such as PubMed Central and the Public Library of Science build upon these ideas [9].
The Open Archives Metadata Harvesting Protocol grew out of an effort to solve some of the problems that
were emerging as e-print servers became more widely deployed; it originated in the community concerned
with advancing the development of e-print archives. The protocol was widely known as the Open Archives
Protocol, and the program to develop it was widely known as the Open Archives Initiative, so the decision
was made to maintain the popularly known terminology.
The protocol is now often referred to as the Open Archives Metadata Harvesting Protocol in an attempt to
reintroduce a bit more clarity. This Metadata Harvesting Protocol can employ to make metadata describing
objects housed at that server available to external applications that wish to collect this metadata. A server
does not need to be part of an e-print program to use the protocol; indeed, it does not need to house
journal papers at all. The server does not need to offer free access to the digital objects that it stores.
4th International Convention CALIBER-2006, Gulbarga, 2-4 February, 2006 © INFLIBNET Centre, Ahmedabad
METADATA HARVESTING AND THE OPEN
ARCHIVES INITIATIVE
Purushothama Gowda M M K Bhandi
Abstract
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a collaborative
effort that provides an application- independent interoperability framework based on
Metadata Harvesting. Though the OAI-PMH is a very recent development it is being regarded
as an important step towards information discovery in the digital library arena. This paper
looks into the issues leading to its development as well as gives an inside view of the
proposed model.
Keywords: Metadata Harvesting Protocol, Open Archives Initiative, 239.50, Repositories
1 Introduction
There has been considerable confusion about the Open Archives Initiative (OAI)- Metadata Harvesting
Protocol (MHP), mostly beginning with and stemming from its name. The protocol no longer has much to
do with archiving or archives, other than in terms of its heritage. The OAI-PMH is a means of making
machine-readable metadata widely available for use. The Open Archives Initiative was originally proposed
to enhance access to e-print/pre-print archives. Gradually, however, the scope of the initiative has broadened
to cover any kind of digital content including images and videos. It is available to all regardless of
economic mechanism surrounding the content. The fundamental idea here is that authors would deposit
preprints and/or copies of published versions of their articles into such servers, thus providing readers
worldwide with a free way of obtaining access to these papers, without needing paid subscription access
to the source electronic journals. The proponents of this movement argue that the refereed scholarly
journal literature really belongs to the scholarly community and by extension to the world at large, and that
such free access is better aligned with the interests of both authors and readers. The deposit of preprints
would also speed up and democratize the frontiers of research and access to new knowledge; instead
of a privileged circle of members of “invisible colleges” sharing preprints, these preprints would be
available to everyone immediately, without the delays introduced by the journal refereeing and publication
cycle. Proposals such as PubMed Central and the Public Library of Science build upon these ideas [9].
The Open Archives Metadata Harvesting Protocol grew out of an effort to solve some of the problems that
were emerging as e-print servers became more widely deployed; it originated in the community concerned
with advancing the development of e-print archives. The protocol was widely known as the Open Archives
Protocol, and the program to develop it was widely known as the Open Archives Initiative, so the decision
was made to maintain the popularly known terminology.
The protocol is now often referred to as the Open Archives Metadata Harvesting Protocol in an attempt to
reintroduce a bit more clarity. This Metadata Harvesting Protocol can employ to make metadata describing
objects housed at that server available to external applications that wish to collect this metadata. A server
does not need to be part of an e-print program to use the protocol; indeed, it does not need to house
journal papers at all. The server does not need to offer free access to the digital objects that it stores.
4th International Convention CALIBER-2006, Gulbarga, 2-4 February, 2006 © INFLIBNET Centre, Ahmedabad
Page 2
63
2. History of OAI
The origin of OAI can be traced back to the efforts to increase interoperability among the e- print/pre- print
servers that hosted scientific and technical papers [3]. A number of factors led to the development of the
pre-print archives most important of which was the rising cost of journals. Scholars and researchers
would deposit their articles and papers into these servers, which allow for the dissemination of information
among the scholarly community much more rapidly than through traditional print journals. The number of
e-print/pre-print repositories was growing steadily in the nineties. This growth created an information
overload and some other problems, which can be summarized as:
1. The end-users/scholars may not be able to know the existence of a repository.
2. Overlapping of coverage in terms of subjects.
3. Multi-disciplinary nature of subjects needed the documents to be kept at a number of repositories.
4. Discipline-specific and institution-specific archives created duplication efforts.
5. The end-users/scholars had to search individual repositories to get documents of his interest.
6. Also, it was undesirable to require scholars to deposit their work in multiple repositories.
Need was felt to build a framework to bring about a kind of integration of these e-print/pre-print archives
to solve these problems. A meeting was convened in late 1999 at Santa Fe, New Mexico to address
problems of the e-print world. The major work was to define an interface to permit e- print servers to
expose their metadata for the papers it held, so that search services or other similar repositories could
then harvest its metadata. These archives would then act as a federation of repositories by giving a
single search platform for multiple collections.After the meeting, the agreed principles were launched in
January 2000 as the Open archives Initiative specification by Herbert Van de Sompel, Rick Luce, and Paul
Gisparg among others.
The Digital Library Federation, the Coalition for Networked Information, and the National Science
Foundation sponsored it.The OAI Steering Committee was formed in August 2000 to give the strategic
direction to the protocol. The protocol version 1.1 was launched in July 2001. The Open Archives Initiative
Technical Committee (OAI-TC) was formed to develop and write version 2 of the Open Archives Protocol
for metadata Harvesting based on feedback from implementers. The OAI-PMH version 2.0 was eventually
released in June 2002 (
3 OAI vs. Z39.50
There was a debate as to why not use the existing Z39.50 protocol, which is also used for the search and
transfer of metadata. The OAI’s metadata - harvesting approach might look operationally much different
to the Z39.50, but both achieve what’s often called “federated searching.” The federated searches allow
users to gather information from multiple related resources through a single interface. The basic difference
between the two protocols is in the search approach. The Z39.50 allows clients to search multiple
information servers in a single search interface in real time, whereas the OAI-PMH allows bulk transfer
of metadata from the repositories to the Service Providers’ database. Hence the clients do not need
search multiple data providers in real time rather they search the metadata database of the Service
Provider who collect and aggregate the metadata from different data providers.
There were many reasons to have a completely new protocol rather than implementing the Z39.50 as it
stands. Some of the reasons are:
2. History of OAI
The origin of OAI can be traced back to the efforts to increase interoperability among the e- print/pre- print
servers that hosted scientific and technical papers [3]. A number of factors led to the development of the
pre-print archives most important of which was the rising cost of journals. Scholars and researchers
would deposit their articles and papers into these servers, which allow for the dissemination of information
among the scholarly community much more rapidly than through traditional print journals. The number of
e-print/pre-print repositories was growing steadily in the nineties. This growth created an information
overload and some other problems, which can be summarized as:
1. The end-users/scholars may not be able to know the existence of a repository.
2. Overlapping of coverage in terms of subjects.
3. Multi-disciplinary nature of subjects needed the documents to be kept at a number of repositories.
4. Discipline-specific and institution-specific archives created duplication efforts.
5. The end-users/scholars had to search individual repositories to get documents of his interest.
6. Also, it was undesirable to require scholars to deposit their work in multiple repositories.
Need was felt to build a framework to bring about a kind of integration of these e-print/pre-print archives
to solve these problems. A meeting was convened in late 1999 at Santa Fe, New Mexico to address
problems of the e-print world. The major work was to define an interface to permit e- print servers to
expose their metadata for the papers it held, so that search services or other similar repositories could
then harvest its metadata. These archives would then act as a federation of repositories by giving a
single search platform for multiple collections.After the meeting, the agreed principles were launched in
January 2000 as the Open archives Initiative specification by Herbert Van de Sompel, Rick Luce, and Paul
Gisparg among others.
The Digital Library Federation, the Coalition for Networked Information, and the National Science
Foundation sponsored it.The OAI Steering Committee was formed in August 2000 to give the strategic
direction to the protocol. The protocol version 1.1 was launched in July 2001. The Open Archives Initiative
Technical Committee (OAI-TC) was formed to develop and write version 2 of the Open Archives Protocol
for metadata Harvesting based on feedback from implementers. The OAI-PMH version 2.0 was eventually
released in June 2002 (
3 OAI vs. Z39.50
There was a debate as to why not use the existing Z39.50 protocol, which is also used for the search and
transfer of metadata. The OAI’s metadata - harvesting approach might look operationally much different
to the Z39.50, but both achieve what’s often called “federated searching.” The federated searches allow
users to gather information from multiple related resources through a single interface. The basic difference
between the two protocols is in the search approach. The Z39.50 allows clients to search multiple
information servers in a single search interface in real time, whereas the OAI-PMH allows bulk transfer
of metadata from the repositories to the Service Providers’ database. Hence the clients do not need
search multiple data providers in real time rather they search the metadata database of the Service
Provider who collect and aggregate the metadata from different data providers.
There were many reasons to have a completely new protocol rather than implementing the Z39.50 as it
stands. Some of the reasons are:
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
6 Readers on Mendeley
by Discipline
17% Philosophy
by Academic Status
50% Ph.D. Student
33% Student (Master)
17% Associate Professor
by Country
50% United States
17% United Kingdom
17% Colombia


