Sign up & Download
Sign in

Reducing semantic complexity in distributed Digital Libraries: treatment of term vagueness and document re-ranking

by Philipp Mayr, Peter Mutschke, Vivien Petras
Library Review (2007)

Abstract

The purpose of the paper is to propose models to reduce the semantic complexity in heterogeneous DLs. The aim is to introduce value-added services (treatment of term vagueness and document re-ranking) that gain a certain quality in DLs if they are combined with heterogeneity components established in the project "Competence Center Modeling and Treatment of Semantic Heterogeneity". Empirical observations show that freely formulated user terms and terms from controlled vocabularies are often not the same or match just by coincidence. Therefore, a value-added service will be developed which rephrases the natural language searcher terms into suggestions from the controlled vocabulary, the Search Term Recommender (STR). Two methods, which are derived from scientometrics and network analysis, will be implemented with the objective to re-rank result sets by the following structural properties: the ranking of the results by core journals (so-called Bradfordizing) and ranking by centrality of authors in co-authorship networks.

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Reducing semantic complexity in distributed Digital Libraries: treatment of term vagueness and document re-ranking

Reducing
semantic
complexity
213
Library Review
Vol. 57 No. 3, 2008
pp. 213-224
# Emerald Group Publishing Limited
0024-2535
DOI 10.1108/00242530810865484
Received 19 October 2007
Reviewed 9 November 2007
Accepted 13 November 2007
Reducing semantic complexity in
distributed digital libraries
Treatment of term vagueness and document
re-ranking
Philipp Mayr, Peter Mutschke and Vivien Petras
GESIS-IZ Social Science Information Centre, Bonn, Germany
Abstract
Purpose – The general science portal ‘‘vascoda’’ merges structured, high-quality information
collections from more than 40 providers on the basis of search engine technology (FAST) and a
concept which treats semantic heterogeneity between different controlled vocabularies. First
experiences with the portal show some weaknesses of this approach which come out in most
metadata-driven Digital Libraries (DLs) or subject specific portals. The purpose of the paper is to
propose models to reduce the semantic complexity in heterogeneous DLs. The aim is to introduce
value-added services (treatment of term vagueness and document re-ranking) that gain a certain
quality in DLs if they are combined with heterogeneity components established in the project
‘‘Competence Center Modeling and Treatment of Semantic Heterogeneity’’.
Design/methodology/approach – Two methods, which are derived from scientometrics and
network analysis, will be implemented with the objective to re-rank result sets by the following
structural properties: the ranking of the results by core journals (so-called Bradfordizing) and ranking
by centrality of authors in co-authorship networks.
Findings – The methods, which will be implemented, focus on the query and on the result side of a
search and are designed to positively influence each other. Conceptually, they will improve the search
quality and guarantee that the most relevant documents in result sets will be ranked higher.
Originality/value – The central impact of the paper focuses on the integration of three structural
value-adding methods, which aim at reducing the semantic complexity represented in distributed
DLs at several stages in the information retrieval process: query construction, search and ranking
and re-ranking.
Keywords Digital libraries, Worldwide web, Information management
Paper type Research paper
Introduction
In the area of scientific and academic information systems, a whole array of
bibliographic databases, disciplinary Internet portals, institutional repositories or
archival and other media type collections are increasingly accumulated and embedded
in all-encompassing information systems. Such collections are necessary in order to
meet user expectations that demand one-stop ‘‘information fulfillment’’. Examples are
Elsevier’s Scirus portal[1], the Online Computer Library Center WorldCat union
catalog[2] or Tuft University’s Perseus project[3].
In Germany, an ambitious project for one-stop academic search is the vascoda
portal[4], a joint project between the BMBF (Federal Ministry for Education and
Research) and the DFG (German Research Foundation). Vascoda provides a federated
search interface for a multitude of disciplinary and interdisciplinary databases (e.g.
full-text article databases, indexing and abstracting services, library catalogs) and
internet resource collections.
The vascoda portal contains many information collections that are meticulously
developed and structured. They have sophisticated subject metadata schemes (subject
headings, thesauri or classifications) to describe and organise the content of the
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0024-2535.htm
Page 2
hidden
LR
57,3
214
documents on an individual collection level. The general search interface, however,
only provides a free-text search over all metadata fields without regard for the precise
subject access tools that were originally intended for these information collections.
If large-scale contemporary information organisation efforts like the Semantic
Web[5] (see also Krause, 2006, 2007, 2008) strive to provide more structure and
semantic resolution with respect to information content, how is it possible that
advanced interfaces for digital libraries (DLs) scale back on exactly the same issue?.
Search – both in full-text collections like the Internet or more heavily structured and
less diverse collections like institutional repositories, indexing databases or library
catalogs as described above – only works as well as the matching between the
language in queries and the language in the searched documents. If the words in the
query are different from the words in a relevant document, this document will not be
found. The problem of matching query terms to document terms is a result of the
ambiguity or vagueness of language (Blair, 1990, 2003).
Because of the sheer size and variation of large full-text databases, this problem is
not as noticeable because any query (even if they contain spelling mistakes or nonsense
statements) will find documents. The problem is aggravated in collections of more
restricted volume or text (i.e. repositories that contain only formal metadata, some
subject description and just a link to the full-text). The issue becomes even more critical
when several collections with different metadata schemes are searched at the same
time – which is the case in the distributed search scenario. In this scenario, not only is
the matching between query and document terms affected by language ambiguity, but
also the matching between different subject-describing metadata schemes. In Figure 1,
we speak of vagueness 1 and vagueness 2/3 (V1 and V2/3) to denote the different areas
where language ambiguity can occur. For successful retrieval in any DL, both levels of
vagueness have to be addressed (compare Hellweg et al., 2001).
Furthermore, the result sets of transformed or expanded queries in distributed
collections are often very large and tests show that the conventional web-based ranking
methods are not appropriate for the heterogeneous metadata records. Therefore, two
methods, which are derived from scientometrics and network analysis, will be
implemented with the objective to re-rank result sets: (a) the ranking of the results by
core journals (so-called Bradfordizing) and (b) ranking by centrality of authors in
co-authorship networks.
Figure 1.
Two step methodology of
vagueness treatment

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

13 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
23% Other Professional
 
15% Student (Master)
 
8% Student (Bachelor)
by Country
 
38% Germany
 
15% India
 
8% Italy