Sign up & Download
Sign in

Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures.

by K Humphreys, G Demetriou, R Gaizauskas
Pacific Symposium On Biocomputing ()

Abstract

Information extraction technology, as defined and developed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this paper we consider the application of this technology to the extraction of information from scientific journal papers in the area of molecular biology. In particular, we describe how an information extraction system designed to participate in the MUC exercises has been modified for two bioinformatics applications: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure. Progress to date provides convincing grounds for believing that IE techniques will deliver novel and effective ways for scientists to make use of the core literature which defines their disciplines.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Two applications of information e...

Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George Demetriou, Rob ert Gaizauskas Department of Computer Science, University of Sheeld Regent Court, Portobel lo Street Sheeld S1 4DP UK Information extraction technology, as dened and develop ed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this pap er we consider the application of this technology to the extraction of information from scientic journal pap ers in the area of molecular biology. In particular, we describ e how an information extrac- tion system designed to participate in the MUC exercises has b een mo died for two bioinformatics applications: EMPathIE, concerned with enzyme and meta- b olic pathways and PASTA, concerned with protein structure. Progress to date provides convincing grounds for b elieving that IE techniques will deliver novel and eective ways for scientists to make use of the core literature which denes their disciplines. 1 Intro duction Information Extraction (IE) may b e dened as the activity of extracting de- tails of predened classes of entities and relationships from natural language texts and placing this information into a structured representation called a template 1 2 . The prototypical IE tasks are those dened by the U.S. DARPA Message Understanding Conferences (MUCs), requiring the lling of a complex template from newswire texts on sub jects such as joint venture announcements, management succession events, or ro cket launchings 3 4 . While the p erform- ance of current technology is not yet at human levels overall, it is approaching human levels for some comp onent tasks (e.g. the recognition and classication of named entities in text) and is at a level at which comparable technologies, such as information retrieval and machine translation, have found useful ap- plication. IE is particularly relevant where large volumes of text make human analysis infeasible, where template-oriented information seeking is appropriate (i.e. where there is a relatively stable information need and a set of texts in a relatively narrow domain), where conventional information retrieval technology is inadequate, and where some error can b e tolerated. One area where we b elieve these criteria are met, and where IE techniques have as yet b een applied only in a very limited way, is the construction of data- bases of scientic information from journal articles, for use by researchers in Pacific Symposium on Biocomputing 5:502-513 (2000)
Page 2
hidden
molecular biology. a The explosive growth of textual material in this area means that no one can keep up with what is b eing published. Conventional retrieval technology returns b oth to o little, b ecause of the complex, non-standardised terminology in the area, and to o much, b ecause what is sought is not whole texts in which key terms app ear, but facts buried in the texts. Further, use- ful templates can b e dened for some scientic tasks. For example, scientists working on drug discovery have an ongoing interest in reactions catalysed by enzymes in metab olic pathways. These reactions may b e viewed as a class of events, like corp orate management succession events, in which various classes of entities (enyzmes, comp ounds) with attributes (names, concentrations) are related by participating in the event in particular roles (substrate, catalyst, pro duct). Finally, some error can b e tolerated in these applications, b ecause scientists can verify the information against the source texts { the technology serves to assist, not to replace, investigation. In this pap er we describ e the use of the technology develop ed through MUC evaluations in two bioinformatics applications. The next section describ es the general functionality of an IE system, and section 3 then describ es the two sp ecic applications on which we are working: extraction of information ab out enzymes and metab olic pathways and extraction of information ab out protein structure, in b oth cases from scientic abstracts and journal pap ers. Section 4 describ es the principle pro cessing stages and techniques of our system, and section 5 describ es preliminary results. While neither of these systems is yet complete, indications are that IE can indeed b e successfully applied to the task of extracting information from scientic journal pap ers. 2 Information Extraction Technology The most recent MUC evaluation (MUC-7) 4 sp ecied ve separate comp onent tasks, which illustrate the main functional capabilities of current IE systems: 1. Named Entity recognition requires the recognition and classication of named entities such as organisations, p ersons, lo cations, dates and mon- etary amounts. 2. Coreference resolution requires the identication of expressions in the text that refer to the same ob ject, set or activity. These include variant forms of name expression (Ford Motor Company . . . Ford), denite noun phrases and their antecedents (Ford . . . the American car manufacturer), and pronouns and their antecedents (President Clinton . . . he). a The only other application of IE techniques to texts in the biological sciences of which we are aware is the work of Fukada et al. on identifying protein names in MEDLINE abstracts 5 . Pacific Symposium on Biocomputing 5:502-513 (2000)

Readership Statistics

18 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
28% Ph.D. Student
 
17% Researcher (at a non-Academic Institution)
 
17% Post Doc
by Country
 
39% United States
 
33% United Kingdom
 
6% Germany

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in