Advancing translational research ...
BioMed Central Page 1 of 16 (page number not for citation purposes) BMC Bioinformatics Open Access Methodology Advancing translational research with the Semantic Web Alan Ruttenberg1, Tim Clark2, William Bug3, Matthias Samwald4, Olivier Bodenreider5, Helen Chen6, Donald Doherty7, Kerstin Forsberg8, Yong Gao9, Vipul Kashyap10, June Kinoshita11, Joanne Luciano12, M Scott Marshall13, Chimezie Ogbuji14, Jonathan Rees15, Susie Stephens16, Gwendolyn T Wong11, Elizabeth Wu11, Davide Zaccagnini17, Tonya Hongsermeier10, Eric Neumann18, Ivan Herman19 and Kei- Hoi Cheung*20 Address: 1Millennium Pharmaceuticals, Cambridge, MA, USA, 2Initiative in Innovative Computing, Harvard University, Cambridge, MA, USA, 3Laboratory for Bioimaging and Anatomical Informatics, Department of Neurobiology and Anatomy, Drexel University College of Medicine, Philadelphia, PA, USA, 4Section on Medical Expert and Knowledge-Based Systems, Medical University of Vienna, Vienna, Austria, 5National Library of Medicine, Bethesda, MD, USA, 6Agfa Healthcare, Waterloo, Ontario, Canada, 7Brainstage Research, Pittsburgh, PA, USA, 8AstraZeneca, M��lndal, Sweden, 9MassGeneral Institute for Neurodegenerative Disease, Massachusetts General Hospital, Charlestown, MA, USA, 10Partners HealthCare System, Wellesley, MA, USA, 11Alzheimer Research Forum, Boston, MA, USA, 12Harvard Medical School, Boston, MA, USA, 13Integrative Bioinformatics Unit, University of Amsterdam, Amsterdam, The Netherlands, 14Cleveland Clinic Foundation, Cleveland, OH, USA, 15Science Commons, Cambridge, MA, USA, 16Oracle, Burlington, MA, USA, 17Language & Computing, Reston, VA, USA, 18Teranode Corporation, Seattle, WA, USA, 19World Wide Web Consortium (W3C) and 20Center for Medical Informatics, Yale University School of Medicine, New Haven, CT, USA Email: Alan Ruttenberg - firstname.lastname@example.org Tim Clark - email@example.com William Bug - William.Bug@drexelmed.edu Matthias Samwald - firstname.lastname@example.org Olivier Bodenreider - email@example.com Helen Chen - firstname.lastname@example.org Donald Doherty - email@example.com Kerstin Forsberg - firstname.lastname@example.org Yong Gao - email@example.com Vipul Kashyap - firstname.lastname@example.org June Kinoshita - email@example.com Joanne Luciano - firstname.lastname@example.org M Scott Marshall - email@example.com Chimezie Ogbuji - firstname.lastname@example.org Jonathan Rees - email@example.com Susie Stephens - firstname.lastname@example.org Gwendolyn T Wong - email@example.com Elizabeth Wu - firstname.lastname@example.org Davide Zaccagnini - email@example.com Tonya Hongsermeier - firstname.lastname@example.org Eric Neumann - email@example.com Ivan Herman - firstname.lastname@example.org Kei-Hoi Cheung* - email@example.com * Corresponding author Abstract Background: A fundamental goal of the U.S. National Institute of Health (NIH) "Roadmap" is to strengthen Translational Research, defined as the movement of discoveries in basic research to application at the clinical level. A significant barrier to translational research is the lack of uniformly structured data across related biomedical domains. The Semantic Web is an extension of the current Web that enables navigation and meaningful use of digital resources by automatic processes. It is based on common formats that support aggregation and integration of data drawn from diverse sources. A variety of technologies have been built on this foundation that, together, support identifying, representing, and reasoning across a wide range of biomedical data. The Semantic Web Health Care and Life Sciences Interest Group (HCLSIG), set up within the framework of the World Wide Web Consortium, was launched to explore the application of these technologies in a variety of areas. Subgroups focus on making biomedical data available in RDF, working with biomedical ontologies, prototyping clinical decision support systems, working on drug safety and efficacy communication, and supporting disease researchers navigating and annotating the large amount of potentially relevant literature. Published: 9 May 2007 BMC Bioinformatics 2007, 8(Suppl 3):S2 doi:10.1186/1471-2105-8-S3-S2 supplement title pSemantic E-Science in Biomedicine/p /title editorYimin Wang, Zhaohui Wu, Huajun Chen/editor noteResearch/note /supplement This article is available from: http://www.biomedcentral.com/1471-2105/8/S3/S2 �� 2007 Ruttenberg et al licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2007, 8(Suppl 3):S2 http://www.biomedcentral.com/1471-2105/8/S3/S2 Page 2 of 16 (page number not for citation purposes) Results: We present a scenario that shows the value of the information environment the Semantic Web can support for aiding neuroscience researchers. We then report on several projects by members of the HCLSIG, in the process illustrating the range of Semantic Web technologies that have applications in areas of biomedicine. Conclusion: Semantic Web technologies present both promise and challenges. Current tools and standards are already adequate to implement components of the bench-to-bedside vision. On the other hand, these technologies are young. Gaps in standards and implementations still exist and adoption is limited by typical problems with early technology, such as the need for a critical mass of practitioners and installed base, and growing pains as the technology is scaled up. Still, the potential of interoperable knowledge sources for biomedicine, at the scale of the World Wide Web, merits continued work. Background Translational research and the information ecosystem Starting in 2002, the NIH began a process of charting a "roadmap" for medical research in the 21st century , identifying gaps and opportunities in biomedical research that crossed the boundaries of then extant research insti- tutions. A key initiative that came out of this review is a move to strengthen Translational Research, defined as the movement of discoveries in basic research (the Bench) to application at the clinical level (the Bedside). Much of the ability of biomedical researchers and health care practitioners to work together ��� exchanging ideas, information, and knowledge across organizational, gov- ernance, socio-cultural, political, and national boundaries ��� is mediated by the Internet and its ever-increasing digital resources. These resources include scientific literature, experimental data, summaries of knowledge of gene prod- ucts, diseases, and compounds, and informal scientific discourse and commentary in a variety of forums. Together this information comprises the scientific "infor- mation ecosystem" . Despite the revolution of the Web, the structure of this information, as evidenced by a large number of heterogeneous data formats, continues to reflect a high degree of idiosyncratic domain specializa- tion, lack of schematization, and schema mismatch. The lack of uniformly structured data affects many areas of biomedical research, including drug discovery, systems biology, and individualized medicine, all of which rely heavily on integrating and interpreting data sets produced by different experimental methods at different levels of granularity. Complicating matters is that advances in instrumentation and data acquisition technologies, such as high-throughput genotyping, DNA microarrays, pro- tein arrays, mass spectrometry, and high-volume ano- nymized clinical research and patient data are resulting in an exponential growth of healthcare as well as life science data. This data has been provided in numerous discon- nected databases ��� sometimes referred to as data silos. It has become increasingly difficult to even discover these databases, let alone characterize them. Together, these aspects of the current information ecosys- tem work against the interdisciplinary knowledge transfer needed to improve the bench-to-bedside process. Curing and preventing disease requires a synthesis of understanding across disciplines In applying research to cure and prevent diseases, an inte- grated understanding across subspecialties becomes essential. Consider the study of neurodegenerative dis- eases such as Parkinson's Disease (PD), Alzheimer's Dis- ease (AD), Huntington's Disease (HD), Amyotrophic Lateral Sclerosis (ALS), and others. Research on these dis- eases spans the disciplines of psychiatry, neurology, microscopic anatomy, neuronal physiology, biochemis- try, genetics, molecular biology, and bioinformatics. As an example, AD affects four million people in the U.S. population and causes great suffering and incurs enor- mous healthcare costs. Yet there is still no agreement on exactly how it is caused, or where best to intervene to treat it or prevent it. The Alzheimer Research Forum records more than twenty seven significant hypotheses  related to aspects of the etiology of AD, most of them combining supporting data and interpretations from multiple bio- medical specialist areas. One recent hypothesis on the cause of AD  illustrates the typical situation. The hypothesis combines data from research in mouse genetics, cell biology, animal neuropsy- chology, protein biochemistry, neuropathology, and other areas. Though commensurate with the "ADDL hypothesis" of AD etiology , essential claims in Lesn�� et al. conflict with those in other equally well-supported hypotheses, such as the amyloid cascade  and alterna- tive amyloid cascade . Consider also HD an inherited neurodegenerative disease. Although its genetic basis is relatively simple and it has been a model for autosomal dominant neurogenetic dis- orders for many years, , the mechanisms by which the disorder causes pathology are still not understood. In the case of PD, despite its having been studied for many dec- ades, there are profound difficulties with some of the
BMC Bioinformatics 2007, 8(Suppl 3):S2 http://www.biomedcentral.com/1471-2105/8/S3/S2 Page 3 of 16 (page number not for citation purposes) existing treatments [9,10], and novel or modified treat- ments are still being developed [11,12]. Increasingly, researchers recognize that Ad, PD, and HD share various features at the clinical , neural [14-17], cellular [18-20], and molecular levels [21,22]. Nonethe- less, it is still common for biologists in different subspe- cialties to be unaware of the key literature in one another's domain. These observations lead us to a variety of desiderata for the information environment that can support such syn- thesis. It should take advantage of the Web's ability to ena- ble dissemination of and access to vast amounts of information. Queries need to be made across experimen- tal data regardless of the community in which it origi- nates. Making cross-disease connections and combining knowledge from the molecular to the clinical level has to be practical in order to enable cross-disciplinary projects. Both well-structured standardized representation of data as well as linking and discovery of convergent and diver- gent interpretations of it must be supported in order to support activities of scientists and clinicians. Finally, the elements of this information environment should be linked to both the current and evolving scientific publica- tion process and culture. The Semantic Web The Semantic Web [23,24] is an extension of the current Web that enables navigation and meaningful use of dig- ital resources by automatic processes. It is based on com- mon formats that support aggregation and integration of data drawn from diverse sources. Currently, links on Web pages are uncharacterized. There is no explicit information that tells a machine that the mRNA described by ahref="/entrez/ viewer.fcgi?val=NM_000546.2" on the Entrez page about Human TP53 gene  is related to TP53 in any specific way. By contrast, on the Semantic Web, the rela- tionship between the gene and the transcribed mRNA product would be captured in a statement that identifies the two entities and the type of the relationship between them. Such statements are called "triples" because they consist of three parts ��� subject, predicate, and object. In this case we might say that the subject is human TP53 gene, the predicate (or relationship) hasGeneProduct, and the object human TP53 MRNA. Just as the subject and object ��� the pages describing the gene and mRNA ��� are identified by Uniform Resource Identifiers (URIs) , so, too, is the relationship, the full name of which might be http:// www.ncbi.nlm.nih.gov/entrez/hasGeneProduct. A Web browser viewing that location might show the human readable definition of the relationship. Since URIs can be used to describe names, all information accessible on the Web today can be part of statements in the Semantic Web. If two statements refer to identical URIs, this means that their subjects of discourse are iden- tical. This makes it possible to merge data references. This process is the basis of data and knowledge integration on the Semantic Web. With this as a foundation, a number of existing approaches for organizing knowledge are being adapted for use on the Semantic Web. Among these are thesauri, ontologies, rule systems, frame based representation sys- tems, and various other forms of knowledge representa- tion. Together, the uniform naming of elements of discourse by URIs, the shared standards and technologies around these methods of organization, and the growing set of shared practices in using those, are known as Semantic Web technologies. The formal definition of relations among Web resources is at the basis of the Semantic Web. Resource Description Framework (RDF) , is one of the fundamental build- ing blocks of the Semantic Web, and gives a formal speci- fication for the syntax and semantics of statements (triples). Beyond RDF, a number of additional building blocks are necessary to achieve the Semantic Web vision. ��� The specification of a query language, SPARQL , by which one can retrieve answers from a body of statements. ��� Languages to define the controlled vocabularies and ontologies that aid interoperability the RDF Schema (RDFS) , Simple Knowledge Organization System (SKOS) , and the Web Ontology Language (OWL) . ��� Tools and strategies to extract or translate from non-RDF data sources to enable their interoperability with data organized as statements. For example, GRDDL (Gleaning Resource Descriptions from Dialects of Languages)  defines a way of associating XML with a transformation that turns it into RDF. There are also a variety of RDF extraction tools and interfaces to traditional databases . Specifications of some of these technologies have pub- lished and are stable, while others are still under develop- ment. RDF and OWL are about three years old, a long time on the Web scale, but not such a long time for the devel- opment of good tools and general acceptance by the tech- nical community. Other technology specifications (SKOS, GRDDL, SPARQL, etc.) will only be published as stand- ards in the coming years ��� though usable implementa- tions already exist.