Extending Semantic Provenance into the Web of Data
IEEE Internet Computing (2011)
- ISSN: 10897801
- DOI: 10.1109/MIC.2011.7
Available from
Jun Zhao's profile on Mendeley.
or
Abstract
In this article, the authors provide an example workflow-and a simple classification of user questions on the workflow's data products-to combine and interchange contextual metadata through a semantic data model and infrastructure. They also analyze their approach's potential to support enhanced semantic provenance applications.
Page 1
Extending Semantic Provenance into the Web of Data
Extending Semantic provenance into the Web of Data
Jun Zhao1, Satya S. Sahoo2, Paolo Missier3, Amit Sheth2, Carole Goble3
1 Department of Zoology, University of Oxford, UK, 2 Kno.e.sis Center, Wright State University, USA,
3 School of Computer Science, University of Manchester, UK,
1jun.zhao@zoo.ox.ac.uk, 2{sahoo.2,amit.sheth}@wright.edu, 3{pmissier,carole}@cs.man.ac.uk,
Abstract
The importance of tracking and querying the provenance of experimental data for scientific applications is only
now beginning to emerge, as a number of provenance management systems reach maturity. In addition to
provenance, domain-specific semantic annotations to data products and the Web of Data are playing an
increasingly important role as contextual metadata that can be used to assist with the interpretation of
experimental data. In this article we use an example workflow, and a simple classification of user questions on
the workflow’s data products, to explore the combination of these three strands of contextual metadata through a
semantic data model and infrastructure, and their potential to support enhanced semantic provenance
applications.
Index Terms
Primary classification: H.3.4 Systems and Software (Semantic Web); J.3 Life and Medical Sciences (Biology
and genetics)
Additional classification: D.2.1 Requirements/Specifications; D.2.12 Interoperability (Data mapping)
General Terms: Design, Software
Introduction
The increasing use of computing resources is transforming the way scientific research is carried out, and in the
process it is creating a vast amount of scientific data, especially in the life sciences domain. The challenge now
facing both life sciences and computer science researchers is not in data generation, but rather, in making sure
that any member of a scientific community has the means to correctly interpret automatically generated
information, possibly a long time after it has been produced. This involves complementing observational or
experimental data with various types of annotations, as well as with other contextual metadata. In this article we
focus specifically on provenance metadata, which describes the way data has been produced, and on semantic
annotations, whereby domain-specific terms from some agreed-upon collection of vocabularies are used to
clarify the meaning of the data. Furthermore, we restrict our attention to workflow provenance, that is, the
provenance of data products that are obtained through a (generally automated) computational process consisting
of an orchestration of individual tasks.
The recently emerging Linked Open Data (LOD)1 cloud, i.e. the Web of Data, provides a third kind of
contextual metadata. The LOD initiative promotes the publication of data in machine-accessible format and the
linking amongst heterogeneous data items. It leads to large-scale publication of interlinked data items, including
scientific datasets such as UniProt, KEGG, Reactome, Drug Bank, and NCBI Entrez Gene. These datasets form
a vast graph that can be seamlessly explored and navigated thanks to its uniform representation using the RDF
data model.
Past research provides anecdotal evidence of how each of these three context elements, taken independently, can
be used effectively. Workflow provenance is useful to answer user queries regarding data products computed by
different workflow systems; semantic, domain-specific annotations find their applications mainly in the area of
information interpretation and integration; and the Web of Data exposes a vast amount of scientific data in
structured format that can be searched using the standard RDF query language SPARQL2.
1 http://www.linkeddata.org
2 http://www.w3.org/TR/rdf-sparql-query/
Jun Zhao1, Satya S. Sahoo2, Paolo Missier3, Amit Sheth2, Carole Goble3
1 Department of Zoology, University of Oxford, UK, 2 Kno.e.sis Center, Wright State University, USA,
3 School of Computer Science, University of Manchester, UK,
1jun.zhao@zoo.ox.ac.uk, 2{sahoo.2,amit.sheth}@wright.edu, 3{pmissier,carole}@cs.man.ac.uk,
Abstract
The importance of tracking and querying the provenance of experimental data for scientific applications is only
now beginning to emerge, as a number of provenance management systems reach maturity. In addition to
provenance, domain-specific semantic annotations to data products and the Web of Data are playing an
increasingly important role as contextual metadata that can be used to assist with the interpretation of
experimental data. In this article we use an example workflow, and a simple classification of user questions on
the workflow’s data products, to explore the combination of these three strands of contextual metadata through a
semantic data model and infrastructure, and their potential to support enhanced semantic provenance
applications.
Index Terms
Primary classification: H.3.4 Systems and Software (Semantic Web); J.3 Life and Medical Sciences (Biology
and genetics)
Additional classification: D.2.1 Requirements/Specifications; D.2.12 Interoperability (Data mapping)
General Terms: Design, Software
Introduction
The increasing use of computing resources is transforming the way scientific research is carried out, and in the
process it is creating a vast amount of scientific data, especially in the life sciences domain. The challenge now
facing both life sciences and computer science researchers is not in data generation, but rather, in making sure
that any member of a scientific community has the means to correctly interpret automatically generated
information, possibly a long time after it has been produced. This involves complementing observational or
experimental data with various types of annotations, as well as with other contextual metadata. In this article we
focus specifically on provenance metadata, which describes the way data has been produced, and on semantic
annotations, whereby domain-specific terms from some agreed-upon collection of vocabularies are used to
clarify the meaning of the data. Furthermore, we restrict our attention to workflow provenance, that is, the
provenance of data products that are obtained through a (generally automated) computational process consisting
of an orchestration of individual tasks.
The recently emerging Linked Open Data (LOD)1 cloud, i.e. the Web of Data, provides a third kind of
contextual metadata. The LOD initiative promotes the publication of data in machine-accessible format and the
linking amongst heterogeneous data items. It leads to large-scale publication of interlinked data items, including
scientific datasets such as UniProt, KEGG, Reactome, Drug Bank, and NCBI Entrez Gene. These datasets form
a vast graph that can be seamlessly explored and navigated thanks to its uniform representation using the RDF
data model.
Past research provides anecdotal evidence of how each of these three context elements, taken independently, can
be used effectively. Workflow provenance is useful to answer user queries regarding data products computed by
different workflow systems; semantic, domain-specific annotations find their applications mainly in the area of
information interpretation and integration; and the Web of Data exposes a vast amount of scientific data in
structured format that can be searched using the standard RDF query language SPARQL2.
1 http://www.linkeddata.org
2 http://www.w3.org/TR/rdf-sparql-query/
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
14 Readers on Mendeley
by Discipline
by Academic Status
43% Researcher (at an Academic Institution)
14% Ph.D. Student
14% Student (Master)
by Country
21% United States
21% United Kingdom
14% Netherlands


