Semantic middleware for e-science knowledge spaces
- ISBN: 9781605588476
- DOI: 10.1145/1657120.1657124
Abstract
The Tupelo semantic content management middleware implements Knowledge Spaces that enable scientists to locate, use, link, annotate, and discuss data and metadata as they work with existing applications in distributed environments. Tupelo is built using a combination of commonly-used Semantic Web technologies for metadata management, content management technologies for data management, and workflow technologies for management of computation, and can interoperate with other tools using a variety of standard interfaces and a client and desktop API. Tupelo's primary function is to facilitate interoperability, providing a Knowledge Space "view" of distributed, heterogeneous resources such as institutional repositories, relational databases, and semantic web stores. Knowledge Spaces have driven recent work creating e-Science cyberenvironments to serve distributed, active scientific communities. Tupelo-based components deployed in desktop applications, on portals, and in AJAX applications interoperate to allow researchers to develop, coordinate and share datasets, documents, and computational models, while preserving process documentation and other contextual information needed to produce a complete and coherent research record suitable for distribution and archiving.
Author-supplied keywords
Semantic middleware for e-science knowledge spaces
Semantic Middleware for E-science Knowledge Spaces Joe Futrelle, Jeff Gaynor, Joel Plutchak, James D. Myers , Robert E. McGrath , Peter Bajcsy, Jason Kastner, Kailash Kotwani, Jong Sung Lee, Luigi Marini, Rob Kooper, Terry McLaren, Yong Liu National Center for Supercomputing Applications University of Illinois at Urbana Champaign 1205 W. Clark St., Urbana IL, 61801, USA {futrelle, jgaynor, plutchak, jimmyers, mcgrath, pbajcsy, jkastner, kkotwani, jonglee, lmarini, kooper, tmclaren, yongliu}@illinois.edu ABSTRACT The Tupelo semantic content management middleware implements Knowledge Spaces that enable scientists to locate, use, link, annotate, and discuss data and metadata as they work with existing applications in distributed environments. Tupelo is built using a combination of commonly-used Semantic Web technologies for metadata management, content management technologies for data management, and workflow technologies for management of computation, and can interoperate with other tools using a variety of standard interfaces and a client and desktop API. Tupelo’s primary function is to facilitate interoperability, providing a Knowledge Space “view” of distributed, heterogeneous resources such as institutional repositories, relational databases, and semantic web stores. Knowledge Spaces have driven recent work creating e-Science cyberenvironments to serve distributed, active scientific communities. Tupelo-based components deployed in desktop applications, on portals, and in AJAX applications interoperate to allow researchers to develop, coordinate and share datasets, documents, and computational models, while preserving process documentation and other contextual information needed to produce a complete and coherent research record suitable for distribution and archiving. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed applications Keywords Semantic web, content management, e-science 1. INTRODUCTION Scientific research is becoming increasingly distributed and multi-disciplinary, which brings with it new challenges of integrating the work of scientific communities across organizational and technical boundaries [29, 31]. A number of best-practice technologies from digital libraries, enterprise computing, and web publishing have been applied to scientific work, but have met with limited success because most of the technologies focus on centralized management of static content collections and
are therefore primarily used to archive or disseminate scientific results after the fact, which does little to help scientists produce better results more efficiently [36]. Science automation work has instead focused largely on workflow and Grid technologies for automating routine computational analysis, which not only makes dynamic exploratory development of models cumbersome (e.g., by requiring that scientists adopt a batch programming approach) [12] but also in practice leaves most of the intermediate data products in complex scientific work processes unaccounted for outside of the immediate execution context that produced them. Because of the limitations of these approaches, much scientific data has been embedded in structural containers (e.g., file systems, specialized databases) that are typically organized based on assumed subject matter, level of granularity, hierarchical organization, and object structure, preventing users with different assumptions or organizational schemes from finding or accessing relevant information. Knowledge is typically managed by embedding it as metadata in content objects or in similarly rigid containers, such as scripts or application code, limiting its ability to be shared and “remixed” with other metadata to enable new ways of exploring data and to augment existing knowledge. Several existing approaches address some of these issues but are missing capabilities that are critical for large-scale e-Science. Semantic web technologies provide explicit semantic representations and global identification, providing strong guarantees that heterogeneous metadata can be represented and linked without reducing its semantic specificity. But semantic web technologies are difficult to apply to the scientific use case because much of the tooling available provides only centralized indexing and querying of metadata, with little attention given to linking it to data or providing services for collaborative authorship and exchange of metadata descriptions. Content management systems (e.g., Jackrabbit [1], Drupal [5]) and institutional repositories (e.g., Fedora [7], DSPACE [6]) as well as more automated systems such as iRODS [25] provide generalized management of content and provide support for curation and collaborative authorship (e.g., OAI-ORE [18]), but assume centralized control of data and metadata, and either preserve metadata as a static content object or provide only “dumbed down” metadata support, such as
• Data and metadata should retain its meaning when it is moved from one container to another, because otherwise its meaning will degrade as it migrates through the network. • Metadata should be able to be interpreted automatically as much as possible, because manual effort is not available at scale. • An account of how data was produced is often more valuable than the data itself, and can span multiple, independent processes.
To implement these principles we have adopted semantic web technologies for representing metadata, ideas from content management systems (CMS) for managing data, and have co-developed and implemented the Open Provenance Model (OPM) [27] for describing complex process and data provenance. Rather than assuming that a single deployment framework such as a service oriented architecture (SOA) or workflow engine can subsume all the distributed resources that make up an e-Science environment, we have developed a “Context” abstraction that provides applications in a variety of different deployment scenarios (e.g., desktop, web, Grid) with a semantic content “view” of the resources at hand (e.g., file systems, databases, web services). Contexts can be “wrapped” around existing data providers, storage technologies, query engines, and services. Implementations are provided for file systems, relational databases, and RDF triple stores. Contexts can also be aggregated to provide unification, mirroring, failover, and a variety of other configurations that coordinate access to distributed, heterogeneous sources. Finally Contexts can be used to perform computational inference, extract metadata from data, and enforce local access rules and policies. Tupelo has been used to develop a suite of interoperable, context-aware tools, including the CyberIntegrator provenance-aware exploratory workflow tool, the CyberCollaboratory web-based collaboration tool, and the Digital Synthesis Framework for publishing interactive datasets. These tools have been deployed to create Knowledge Spaces supporting environmental and other sciences, and to provide provenance support for a growing collection of workflow projects as part of the Provenance Challenge workshop series [26, 28], which has brought together developers of workflow systems such as Kepler [21] and Taverna [32] in an attempt to achieve interoperability. 2. TUPELO ARCHITECTURE The design of Tupelo has been informed by a number of other middleware architectures, most notably content management systems (CMS) (e.g., [1, 5]) and Grid computing [42]. Like content management systems, Tupelo manages information using an extensible content model that is decoupled from the storage and indexing technology used to manage it. Like Grid computing, Tupelo assumes that operations can be delegated and transported over the network, to allow large-scale distributed resources to be used. Unlike both approaches, Tupelo can provide uniform access to local or remote resources, even resources that are not under its control. Tupelo is based on an abstraction called “context”, which represents a kind of semantic “view” of distributed resources. Context implementations are responsible for performing “operators”, which are atomic descriptions of requests to either retrieve or modify the contents of a context. Two primary kinds of operations are provided: 1. Metadata operations, including asserting and retracting statements (i.e., RDF statements) and searching for statements that match a query; and
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



