Sign up & Download
Sign in

Word add-in for ontology recognition: semantic enrichment of scientific literature

by J Lynn Fink, Pablo Fernicola, Rahul Chandran, Savas Parastatidis, Alex Wade, Oscar Naim, Gregory B Quinn, Philip E Bourne
BMC Bioinformatics ()

Abstract

Background: In the current era of scientific research, efficient communication of information is paramount. As such, the nature of scholarly and scientific communication is changing; cyberinfrastructure is now absolutely necessary and new media are allowing information and knowledge to be more interactive and immediate. One approach to making knowledge more accessible is the addition of machine-readable semantic data to scholarly articles. Results: The Word add-in presented here will assist authors in this effort by automatically recognizing and highlighting words or phrases that are likely information-rich, allowing authors to associate semantic data with those words or phrases, and to embed that data in the document as XML. The add-in and source code are publicly available at http://www.codeplex.com/UCSDBioLit. Conclusions: The Word add-in for ontology term recognition makes it possible for an author to add semantic data to a document as it is being written and it encodes these data using XML tags that are effectively a standard in life sciences literature. Allowing authors to mark-up their own work will help increase the amount and quality of machine-readable literature metadata.

Cite this document (BETA)

Available from Alex Wade's profile on Mendeley.
Page 1
hidden

Word add-in for ontology recognit...

SOFTWARE Open Access Word add-in for ontology recognition: semantic enrichment of scientific literature J Lynn Fink1*, Pablo Fernicola2, Rahul Chandran1, Savas Parastatidis2, Alex Wade2, Oscar Naim2, Gregory B Quinn3, Philip E Bourne1 Abstract Background: In the current era of scientific research, efficient communication of information is paramount. As such, the nature of scholarly and scientific communication is changing cyberinfrastructure is now absolutely necessary and new media are allowing information and knowledge to be more interactive and immediate. One approach to making knowledge more accessible is the addition of machine-readable semantic data to scholarly articles. Results: The Word add-in presented here will assist authors in this effort by automatically recognizing and highlighting words or phrases that are likely information-rich, allowing authors to associate semantic data with those words or phrases, and to embed that data in the document as XML. The add-in and source code are publicly available at http://www.codeplex.com/UCSDBioLit. Conclusions: The Word add-in for ontology term recognition makes it possible for an author to add semantic data to a document as it is being written and it encodes these data using XML tags that are effectively a standard in life sciences literature. Allowing authors to mark-up their own work will help increase the amount and quality of machine-readable literature metadata. Background In the current era of scientific research, efficient com- munication of information is paramount. Scientists are uncomfortably aware of the exponential growth of digi- tal literature archives and the disproportionate growth of effective data-mining tools. It is currently a major effort in the bioinformatics community to automate the extraction of knowledge from literature [1,2]. Auto- mated knowledge extraction is crucial for 21st century research, especially as research is becoming increasingly more interdisciplinary, needs to be easier to navigate, needs to support the translation of natural language to information quanta, and needs to support data integra- tion efforts [3-5]. In response, the nature of scholarly and scientific communication is changing cyberinfras- tructure is now absolutely necessary and new media are allowing information and knowledge to be more interac- tive and immediate [6,7]. While this revolution in scholarly communication has beenimminent, the approach to dealing with it has not evolved at the samepace. Many basic tools to assist in knowledge extraction from literature already exist (such as cyberinfrastructure, electronic databases, ontologies, and machine-readable document standards), but the scientific community has yet to use them effectively on a large scale. The Semantic Web - an extension of the World Wide Web that enables more meaningful use of electronic resources via automated processes - is an ideal platform for these efforts [8-10], but there is a significant gap to be bridged between the providers and users of the information and the structure of the information. In a recent review Krallinger, Valencia, and Hirschmann nicely summarize the current challenges and resultant applications in the biological sciences which attempt to bridge this divide [11]. Ruttenberg et al. discuss the activ- ities of the Semantic Web Health Care and Life Sciences Interest Group (HCLSIG) which aims to explore and enable the Semantic Web in biomedical domains [5]. One notable innovation is the creation and application of ontologies - specifications of entities, their attributes, * Correspondence: jlfink@ucsd.edu 1Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, CA, 92093-0444 USA Fink et al. BMC Bioinformatics 2010, 11:103 http://www.biomedcentral.com/1471-2105/11/103 �� 2010 Fink et al licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
hidden
and relationships to other entities in a defined domain. Ontologies underpin our efforts to translate natural lan- guage into quantized, standardized information. In the biological sciences, ontologies have attained so much popularity that it has been suggested that their prolifera- tion is increasing in tandem with biological data [12,13]. Considering that the creation of an ontology can require years of work by a large team of experts, this popularity underscores the perceived importance of these efforts. The Gene Ontology in particular is currently widely used in the annotation of many of biological databases [14]. However, the reliable assignation of an ontology term to an entity in one of these databases necessitates a manual review by expert biocurators - a slow process and one that does not scale to the current level of research output [15]. A particularly advantageous use of ontologies is apply- ing them to scientific literature in order to automatically identify, or infer, terms from one or more ontologies in the text of a document. Several groups have made signifi- cant contributions here, although every method has lim- ited accuracy (see [1,2,15-21] for a few examples). Another challenge that is equally daunting is making these data available in the most useful and easily accessi- ble way possible. Currently, the results from automated literature annotation projects are distributed over a num- ber of databases and websites and there is no unified method of either storing or distributing these data. Two excellent approaches to resolve these issues, at least in part, have been undertaken by both authors and publish- ers. The Royal Society of Chemistry Publishing Group���s Project Prospect1 has semantically enriched all articles published in their journals in a machine-readable way. The project won the 2007 ALPSP/Charlesworth Award for Publishing Innovation, a strong indicator of commu- nity approval and interest because the judging panel represented not only publishers, but also scientists and librarians. A similar approach for a single article was undertaken by bioinformaticians in collaboration with the original article authors, and serves as an elegant example of how much can be gained by both semantic enrichment and author-assisted curation [22,23]. Both initiatives use their own mark-up syntax. These projects illustrate the need for, and promise of, semantic enrichment, but there is a noticeable dearth of tools that assist authors in these efforts. Several exist, but have been developed for specific groups of users or very specific applications and are generally not publicly available for use or modification. A few others are avail- able, such as the domain-agnostic Semantic MediaWiki extension2 and WYSIWYM [24,25], and the biomedical- specific OnTheFly [26], but these lack ease of use, flex- ibility, extensibility, or do not allow for author-mediated curation. As a community, we are certainly making progress in automated approaches for inferring and assigning semantic data in literature. However, this process will likely never be perfectly accurate or complete. There are three points that virtually all researchers interested in these efforts will agree upon: 1) adding semantic data to scientific articles is highly beneficial (indeed necessary for the Semantic Web path) 2) accurate and complete inference of these data without at least some human expert curation is not currently possible and 3) accurate and complete inference of these data after the document has been made widely available is an intractable pro- blem. To overcome these challenges, we must prevail upon the authors to assign semantic data to their arti- cles prior to publication or distribution. The Word add- in presented here will assist authors in this effort using community standards and by making it possible for the author of the document, the absolute expert on the con- tent, to do so during the authoring process and to provide this information in the original source document. Implementation Architecture This software functions as an add-in for Microsoft Word 2007. It was developed using the .NET platform and can be installed via a Windows installer. The add-in relies on two key features of Word 2007, its default use of a XML based file format (Office Open XML, specified by ISO/IEC, IS 29500, and Ecma International, Ecma 376, international standards) and Word���s extensibility, both at the user interface and file format levels. At run- time, the add-in generates and stores a configuration file on the end-user���s system. The add-in presents a custom ribbon, a new user interface element introduced in Word 2007, a side panel, and custom dialogs to interact with the end-user. It was a design goal to shield the author from having to be aware of the underlying file format or XML tags. Instead, the goal was to present a user experience that was as intuitive as possible, and that assisted the end- user in their task in a largely automated fashion. The add-in also relies on the Smart Tag architecture in Word, which enables actions to be presented to end- users based on text in the document being recognized through regular expressions or text matching. The add-in contains knowledge of the National Center for Biomedical Ontology (NCBO) BioPortal [12,27] and three major biological databases: the Protein Data Bank (PDB), the UniProt Knowledgebase (UniProtKB) [28], and the NCBI databases GenBank and RefSeq [29,30]. When the end-user selects an ontology, the add-in downloads the ontology file via NCBO web services. The biological database identifiers are recognized via pattern matching. Fink et al. BMC Bioinformatics 2010, 11:103 http://www.biomedcentral.com/1471-2105/11/103 Page 2 of 8

Authors on Mendeley

Readership Statistics

41 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
24% Ph.D. Student
 
20% Post Doc
 
12% Researcher (at an Academic Institution)
by Country
 
27% United States
 
20% United Kingdom
 
7% Spain

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in