An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence.
- PubMed: 18395495
Abstract
OBJECTIVES: This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base. METHODS: We use an ontology-driven approach to integrate two gene resources (Entrez Gene and HomoloGene) and three pathway resources (KEGG, Reactome and BioCyc), for five organisms, including humans. We created the Entrez Knowledge Model (EKoM), an information model in OWL for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway resources. The integrated schema is populated with data from the pathway resources, publicly available in BioPAX-compatible format, and gene resources for which a population procedure was created. The SPARQL query language is used to formulate queries over the integrated knowledge base to answer the three biological queries. RESULTS: Simple SPARQL queries could easily identify hub genes, i.e., those genes whose gene products participate in many pathways or interact with many other gene products. The identification of the genes expressed in the brain turned out to be more difficult, due to the lack of a common identification scheme for proteins. CONCLUSION: Semantic Web technologies provide a valid framework for information integration in the life sciences. Ontology-driven integration represents a flexible, sustainable and extensible solution to the integration of large volumes of information. Additional resources, which enable the creation of mappings between information sources, are required to compensate for heterogeneity across namespaces. RESOURCE PAGE: http://knoesis.wright.edu/research/lifesci/integration/structureddata/JBI-2008/
An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence.
pathway information: Application to the domain of nicotine
dependence
Satya S. Sahoo1,2, Olivier Bodenreider2,$, Joni L. Rutter3, Karen J. Skinner3, and Amit P.
Sheth1
1Kno.e.sis Center, Wright State University, Dayton, OH
2LHNCBC, National Library of Medicine, Bethesda, MD
3DBNBR, National Institute on Drug Abuse, Bethesda, MD
Abstract
Objectives—This paper illustrates how Semantic Web technologies (especially RDF, OWL, and
SPARQL) can support information integration and make it easy to create semantic mashups
(semantically integrated resources). In the context of understanding the genetic basis of nicotine
dependence, we integrate gene and pathway information and show how three complex biological
queries can be answered by the integrated knowledge base.
Methods—We use an ontology-driven approach to integrate two gene resources (Entrez Gene and
HomoloGene) and three pathway resources (KEGG, Reactome and BioCyc), for five organisms,
including humans. We created the Entrez Knowledge Model (EKoM), an information model in OWL
for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway
resources. The integrated schema is populated with data from the pathway resources, publicly
available in BioPAX-compatible format, and gene resources for which a population procedure was
created. The SPARQL query language is used to formulate queries over the integrated knowledge
base to answer the three biological queries.
Results—Simple SPARQL queries could easily identify hub genes, i.e., those genes whose gene
products participate in many pathways or interact with many other gene products. The identification
of the genes expressed in the brain turned out to be more difficult, due to the lack of a common
identification scheme for proteins.
Conclusion—Semantic Web technologies provide a valid framework for information integration
in the life sciences. Ontology-driven integration represents a flexible, sustainable and extensible
solution to the integration of large volumes of information. Additional resources, which enable the
creation of mappings between information sources, are required to compensate for heterogeneity
across namespaces.
Resource page—
http://knoesis.wright.edu/research/lifesci/integration/structured_data/JBI-2008/
© 2008 Elsevier Inc. All rights reserved.
$Corresponding author: Dr. Olivier Bodenreider, National Library of Medicine, 8600 Rockville Pike - MS 3841 (Bldg 38A, Rm B1N28U),
Bethesda, MD 20894 - USA, phone: 301 435-3246 - fax: 301 480-3035, olivier@nlm.nih.gov.
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this
early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is
published in its final citable form. Please note that during the production process errors may be discovered which could affect the content,
and all legal disclaimers that apply to the journal pertain.
NIH Public Access
Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2009 October 23.
Published in final edited form as:
J Biomed Inform. 2008 October ; 41(5): 752–765. doi:10.1016/j.jbi.2008.02.006.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Semantic Web; Semantic mashup; Nicotine dependence; Information integration; Ontologies
1 Introduction
It is estimated that, worldwide, over one billion people smoke tobacco. The detrimental
consequences of smoking on health are well known and include coronary heart disease, lung
cancer and chronic obstructive pulmonary disease. The heritability of nicotine dependence has
long been established and we know that approximately 40–60% of nicotine dependence is due
to genetic contributions, while the remainder is largely environmental [1–3]. In the past few
years, genome-wide linkage and association studies have identified several candidate genes
(e.g., GABAB2, CHRNA4, DDC, BDNF, and COMT.) [4–6]. Saccone et al. identified and
screened 449 human genes putatively involved with nicotine dependence [6]. In addition to
identifying the genes, it is important to understand their functions and interactions, including
their involvement in biological pathways. For example, from a research management
perspective, identification of “hub” genes (i.e., genes involved in multiple pathways) can help
identify further research efforts.
Complex biological queries generally require the integration of information from several
sources. For example, gene information sources, such as Entrez Gene [7], might need to be
integrated with pathway information sources, such as KEGG (Kyoto Encyclopedia for Genes
and Genomics) [8]. Moreover, comparing results across model organisms requires homology
information (provided for example by HomoloGene [9]). These resources, described in detail
later in section 4.4, are generally cross-referenced, which makes it possible for users to navigate
among them in web-based environments. Interlinking is not the same as integration, however;
and these resources do not support the automatic and high-throughput information processing
required for answering complex queries over large amounts of data from heterogeneous
sources. An effective integration strategy is also critical to support the e-Science paradigm that
is characterized by the large volumes of data generated by industrial-scale in-silico processes
[10].
The first obstacle to integration is the format used for the representation of these information
sources. The resources available from the National Center for Biotechnology Information
(NCBI) Entrez system, such as Entrez Gene and HomoloGene, are available in multiple
formats, including XML. Although XML standardizes the representation of information from
a syntactic perspective, it does not make explicit the relations among the various types of
entities in a given resource or across resources. In other words, although the XML file for
Entrez Gene is machine-processable, it cannot be integrated easily or automatically with other
information sources without human intervention. In contrast, the pathway research community
has created a common, formal knowledge model called BioPAX [11] to represent biological
pathway data. BioPAX also provides an information model for representing those data with
formally defined semantics, which includes explicitly modeling the relationships between
different pathway entities.
Recent research in Semantic Web technologies has delivered promising results for information
integration across heterogeneous knowledge sources [12–15]. In effect, the Semantic Web
provides a robust framework that enables the integration, sharing, and reuse of data from
multiple sources. Additionally, the use of a representation based on a formal language allows
software applications to reason over information. Commonly used Semantic Web technologies
include ontology modeling languages such as Web Ontology Language (OWL) [16], data
models such as the Resource Description Framework (RDF) [17], the SPARQL query language
Sahoo et al. Page 2
J Biomed Inform. Author manuscript; available in PMC 2009 October 23.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



