Sign up & Download
Sign in

Semantic middleware for e-science knowledge spaces

by Joe Futrelle, Jeff Gaynor, Joel Plutchak, James D Myers, Robert E McGrath, Peter Bajcsy, Jason Kastner, Kailash Kotwani, Jong Sung Lee, Luigi Marini, Rob Kooper, Terry McLaren, Yong Liu show all authors
Architecture (2009)

Abstract

The Tupelo semantic content management middleware implements Knowledge Spaces that enable scientists to locate, use, link, annotate, and discuss data and metadata as they work with existing applications in distributed environments. Tupelo is built using a combination of commonly-used Semantic Web technologies for metadata management, content management technologies for data management, and workflow technologies for management of computation, and can interoperate with other tools using a variety of standard interfaces and a client and desktop API. Tupelo's primary function is to facilitate interoperability, providing a Knowledge Space "view" of distributed, heterogeneous resources such as institutional repositories, relational databases, and semantic web stores. Knowledge Spaces have driven recent work creating e-Science cyberenvironments to serve distributed, active scientific communities. Tupelo-based components deployed in desktop applications, on portals, and in AJAX applications interoperate to allow researchers to develop, coordinate and share datasets, documents, and computational models, while preserving process documentation and other contextual information needed to produce a complete and coherent research record suitable for distribution and archiving.

Author-supplied keywords

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Semantic middleware for e-science knowledge spaces

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MGC'09, 30 November - 1 December 2009, Urbana-Champaign, Illinois, US. Copyright 2009 ACM 978-1-60558-847-6/09/11... $10.00
Semantic Middleware for E-science Knowledge Spaces Joe Futrelle, Jeff Gaynor, Joel Plutchak, James D. Myers , Robert E. McGrath , Peter Bajcsy, Jason Kastner, Kailash Kotwani, Jong Sung Lee, Luigi Marini, Rob Kooper, Terry McLaren, Yong Liu National Center for Supercomputing Applications University of Illinois at Urbana Champaign 1205 W. Clark St., Urbana IL, 61801, USA {futrelle, jgaynor, plutchak, jimmyers, mcgrath, pbajcsy, jkastner, kkotwani, jonglee, lmarini, kooper, tmclaren, yongliu}@illinois.edu ABSTRACT The Tupelo semantic content management middleware implements Knowledge Spaces that enable scientists to locate, use, link, annotate, and discuss data and metadata as they work with existing applications in distributed environments. Tupelo is built using a combination of commonly-used Semantic Web technologies for metadata management, content management technologies for data management, and workflow technologies for management of computation, and can interoperate with other tools using a variety of standard interfaces and a client and desktop API. Tupelo’s primary function is to facilitate interoperability, providing a Knowledge Space “view” of distributed, heterogeneous resources such as institutional repositories, relational databases, and semantic web stores. Knowledge Spaces have driven recent work creating e-Science cyberenvironments to serve distributed, active scientific communities. Tupelo-based components deployed in desktop applications, on portals, and in AJAX applications interoperate to allow researchers to develop, coordinate and share datasets, documents, and computational models, while preserving process documentation and other contextual information needed to produce a complete and coherent research record suitable for distribution and archiving. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed applications Keywords Semantic web, content management, e-science 1. INTRODUCTION Scientific research is becoming increasingly distributed and multi-disciplinary, which brings with it new challenges of integrating the work of scientific communities across organizational and technical boundaries [29, 31]. A number of best-practice technologies from digital libraries, enterprise computing, and web publishing have been applied to scientific work, but have met with limited success because most of the technologies focus on centralized management of static content collections and
are therefore primarily used to archive or disseminate scientific results after the fact, which does little to help scientists produce better results more efficiently [36]. Science automation work has instead focused largely on workflow and Grid technologies for automating routine computational analysis, which not only makes dynamic exploratory development of models cumbersome (e.g., by requiring that scientists adopt a batch programming approach) [12] but also in practice leaves most of the intermediate data products in complex scientific work processes unaccounted for outside of the immediate execution context that produced them. Because of the limitations of these approaches, much scientific data has been embedded in structural containers (e.g., file systems, specialized databases) that are typically organized based on assumed subject matter, level of granularity, hierarchical organization, and object structure, preventing users with different assumptions or organizational schemes from finding or accessing relevant information. Knowledge is typically managed by embedding it as metadata in content objects or in similarly rigid containers, such as scripts or application code, limiting its ability to be shared and “remixed” with other metadata to enable new ways of exploring data and to augment existing knowledge. Several existing approaches address some of these issues but are missing capabilities that are critical for large-scale e-Science. Semantic web technologies provide explicit semantic representations and global identification, providing strong guarantees that heterogeneous metadata can be represented and linked without reducing its semantic specificity. But semantic web technologies are difficult to apply to the scientific use case because much of the tooling available provides only centralized indexing and querying of metadata, with little attention given to linking it to data or providing services for collaborative authorship and exchange of metadata descriptions. Content management systems (e.g., Jackrabbit [1], Drupal [5]) and institutional repositories (e.g., Fedora [7], DSPACE [6]) as well as more automated systems such as iRODS [25] provide generalized management of content and provide support for curation and collaborative authorship (e.g., OAI-ORE [18]), but assume centralized control of data and metadata, and either preserve metadata as a static content object or provide only “dumbed down” metadata support, such as
Page 2
hidden
tagging [2] or static schemas, limiting the ability to migrate or reuse content objects as they are actively developed in complex, distributed, heterogeneous work processes. An emerging “semantic grid” practice has begun to address parts of this problem by applying semantic web technologies e-Science [4]. Efforts such as CombeChem [8] and MyExperiment [3] represent a new emphasis on active, shared development of content and workflow by scientists using a variety of web-based, desktop, and handheld interfaces that are integrated using semantic metadata. Other efforts such as nanoHub [16] extend the notion of Grid computing to include more contextual support, such as linking scientific computation through social networks. In part this new practice represents the influence of “web 2.0” practices on science automation [10]. In our view, it also points to a broader vision of digital scholarship enabled by situating scientific activity in systems that provide support not just for creating and sharing processes and data, but also for dynamically recontexualizing, annotating, revising, and tracking information as it is disseminated and used across widely separated communities and disciplines.
We have developed the Tupelo semantic middleware to enable new, more integrated Knowledge Spaces that combine the strengths of semantic web technologies and content management and address their shortcomings [30]. Knowledge Spaces enable users to locate, use, link, annotate, and discuss data and metadata as they work, without having to co-locate all data and metadata at a single institution or in a single repository, or having to abandon existing applications and services. Instead of requiring that data and metadata be restructured, packaged, and submitted to a storage management service, Tupelo allows users and applications to manage descriptions and linked information alongside existing content, as well as providing mechanisms to dynamically locate and recombine content from otherwise uncoordinated sources at various levels of granularity and specificity. At the publication phase, scientists can use knowledge spaces to publish intermediate data and executable descriptions of analytical processes so that other researchers and the public can reproduce, modify, and further share the content of complex ongoing scientific investigations, reducing time to discovery and providing a richer and more complete research record for preservation. The implementation of knowledge spaces in Tupelo middleware is guided by several important principles that have been derived from best practices in data and metadata interoperability [36]:
• Data and metadata should retain its meaning when it is moved from one container to another, because otherwise its meaning will degrade as it migrates through the network. • Metadata should be able to be interpreted automatically as much as possible, because manual effort is not available at scale. • An account of how data was produced is often more valuable than the data itself, and can span multiple, independent processes.
To implement these principles we have adopted semantic web technologies for representing metadata, ideas from content management systems (CMS) for managing data, and have co-developed and implemented the Open Provenance Model (OPM) [27] for describing complex process and data provenance. Rather than assuming that a single deployment framework such as a service oriented architecture (SOA) or workflow engine can subsume all the distributed resources that make up an e-Science environment, we have developed a “Context” abstraction that provides applications in a variety of different deployment scenarios (e.g., desktop, web, Grid) with a semantic content “view” of the resources at hand (e.g., file systems, databases, web services). Contexts can be “wrapped” around existing data providers, storage technologies, query engines, and services. Implementations are provided for file systems, relational databases, and RDF triple stores. Contexts can also be aggregated to provide unification, mirroring, failover, and a variety of other configurations that coordinate access to distributed, heterogeneous sources. Finally Contexts can be used to perform computational inference, extract metadata from data, and enforce local access rules and policies. Tupelo has been used to develop a suite of interoperable, context-aware tools, including the CyberIntegrator provenance-aware exploratory workflow tool, the CyberCollaboratory web-based collaboration tool, and the Digital Synthesis Framework for publishing interactive datasets. These tools have been deployed to create Knowledge Spaces supporting environmental and other sciences, and to provide provenance support for a growing collection of workflow projects as part of the Provenance Challenge workshop series [26, 28], which has brought together developers of workflow systems such as Kepler [21] and Taverna [32] in an attempt to achieve interoperability. 2. TUPELO ARCHITECTURE The design of Tupelo has been informed by a number of other middleware architectures, most notably content management systems (CMS) (e.g., [1, 5]) and Grid computing [42]. Like content management systems, Tupelo manages information using an extensible content model that is decoupled from the storage and indexing technology used to manage it. Like Grid computing, Tupelo assumes that operations can be delegated and transported over the network, to allow large-scale distributed resources to be used. Unlike both approaches, Tupelo can provide uniform access to local or remote resources, even resources that are not under its control. Tupelo is based on an abstraction called “context”, which represents a kind of semantic “view” of distributed resources. Context implementations are responsible for performing “operators”, which are atomic descriptions of requests to either retrieve or modify the contents of a context. Two primary kinds of operations are provided: 1. Metadata operations, including asserting and retracting statements (i.e., RDF statements) and searching for statements that match a query; and

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

15 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
20% Researcher (at an Academic Institution)
 
13% Post Doc
by Country
 
20% United States
 
20% United Kingdom
 
13% Germany