Sign up & Download
Sign in

Structured Literature Image Finder

by Amr Ahmed, Andrew Arnold, Pedro Coelho, Joshua Kangas, Abdul-saboor Sheikh, Eric Xing, William Cohen, Robert F Murphy
Machine Learning (2001)

Abstract

The Structured Literature Image Finder tackles two related problems posed by the vastness of the biomedical literature : how to make it more accessi- ble to scientists in the field and how to take advan- tage of the primary data often locked inside papers. Towards this goal, the slif ...

Cite this document (BETA)

Available from Luis Pedro Coelho's profile on Mendeley.
Page 1
hidden

Structured Literature Image Finder

Structured Literature Image Finder
Amr Ahmed1;6, Andrew Arnold1, Lus Pedro Coelho2;3;4, Joshua Kangas2;3;4,
Abdul-Saboor Sheikh3, Eric Xing1;5;6, William Cohen1, Robert F. Murphy1;2;3;4;5;7
Abstract
The Structured Literature Image Finder tackles
two related problems posed by the vastness of the
biomedical literature: how to make it more accessi-
ble to scientists in the eld and how to take advan-
tage of the primary data often locked inside papers.
Towards this goal, the slif project developed an in-
novative combination of text and image processing
methods.
Images from papers are classi ed according to
their type (
uorescence microscopy image, gel, . . . )
and their caption is parsed for biologically relevant
entities such as protein names. This enables tar-
geted queries for primary data (a feature that a
user study revealed to be highly valued by scien-
tists). Finally, using a novel extension to latent
topic models, we model papers at multiple levels
and provide the ability to nd gures similar to a
query and re ne these ndings with interactive rel-
evance feedback.
Slif is most advanced in processing
uorescent
microscopy images which are further categorised ac-
cording to the depicted subcellular localization pat-
tern.
The results of slif are made available to the
community through a user friendly web interface
(http://slif.cbi.cmu.edu).
1 Introduction
Biomedical research worldwide results in a very
high volume of information in the form of publica-
tions. Biologists are faced with the daunting task of
querying and searching these publications to keep
1Machine Learning Department, 5Department of Bi-
ological Sciences, and 7Department of Biomedical Engi-
neering, Carnegie Mellon University; 2Joint Carnegie Mel-
lon University{University of Pittsburgh Ph.D. Program in
Computational Biology; 3Center for Bioimage Informatics,
Carnegie Mellon University; 4Lane Center for Computa-
tional Biology, Carnegie Mellon University; 6Language Tech-
nologies Institute, Carnegie Mellon University. to whom
correspondence should be addressed.
Figure 1: Screenshot of the slif search engine
showing the results of a search.
up with recent developments and to answer speci c
questions.
In the biomedical literature, data is most of-
ten presented in the form of images. A
uores-
cent micrograph image (fmi) or a gel is sometimes
the key to a whole paper. Compared to gures
in other scienti c disciplines, biomedical gures are
often a stand alone source of information that sum-
marizes the ndings of the research under consid-
eration. A random sampling of such gures in
the publicly available PubMed Central database
reveals that in some, if not most of the cases, a
biomedical gure can provide as much information
as a normal abstract. The information-rich, highly-
evolving knowledge source of the biomedical liter-
ature calls for automated systems that would help
biologists nd information quickly and satisfacto-
rily. These systems should provide biologists with
a structured way of browsing the otherwise unstruc-
tured knowledge in a way that would inspire them
to ask questions that they never thought of before,
or reach a piece of information that they would have
never considered pertinent to start with.
Relevant to this goal, we developed the rst sys-
tem for automated information extraction from im-
ages in biological journal articles (the \Subcellular
Location Image Finder," or slif, rst described in
2001 [9]). Since then, we have made major enhance-
ments and additions to the slif system [3, 8, 7],
1
Page 2
hidden
and now report not only additional enhancements
but the broadening of its reach beyond
uorescent
microscopy images. Re
ecting this, we have now
rechristened slif as the \Structured Literature Im-
age Finder."
Slif reached the nal stage in the Elsevier Grand
Challenge (4 out of 70), a contest sponsored by El-
sevier to \improve the way scienti c information is
communicated and used."
2 Overview
Slif provides both a pipeline for extracting struc-
tured information from papers (illustrated in Fig-
ure 2) and a web-accessible searchable database of
the processed information (depicted in Figure 1).
The pipeline begins by nding all gure-captions
pairs. Each caption is then processed to identify
biological entities (e.g., names of proteins and cell
lines) and these are linked to external databases.
Pointers from the caption to the image are identi-
ed, and the caption is broken into \scopes" so that
terms can be linked to speci c parts of the gure.
The image processing module begins by split-
ting each gure into its constituent panels, and
then identifying the type of image contained in each
panel. The patterns in fmis are described using a
set of biologically relevant image features [9], and
the subcellular location depicted in each image is
recognized.
The last step in the pipeline is to discover latent
topics that are present in the collection of papers.
These topics serve as the basis for visualization and
semantic representation. Each topic consists of a
triplet of distributions over words, image features,
and proteins (possibly extended to include gene on-
tology terms and subcellular locations). Each gure
in turn is represented as a distribution over these
topics, and this distribution re
ects the themes ad-
dressed in the gure. This representation serves
as the basis for various tasks like image-based re-
trieval, text-based retrieval and multimodal-based
retrieval. Moreover, these discovered topics provide
an overview of the information content of the col-
lection, and structurally guide its exploration.
All results of processing are stored in a database,
which is accessible via a web interface or SOAP
queries. The results of queries always include links
back to the panel, gure, caption and the full paper.
Users can query the database for various informa-
tion appearing in captions or images, including spe-
ci c words, protein names, panel types, patterns in
gures, or any combination of the above. Using the
latent topic representation, we built an innovative
interface that allows browsing through gures by
their inferred topics and jumping to related gures
from any currently viewed gure.
3 Caption Processing
In order to identify the protein depicted in an im-
age, we look for protein names in the caption.
The structure of captions can be complex (espe-
cially for multipanel gures). We therefore imple-
mented a system for processing captions with three
goals: identifying the \image pointers" (e.g., \(A)"
or\(red)") in the caption that refer to speci c panel
labels or panel colors in the gure [3], dividing the
caption into fragments that refer to an individual
panel, color, or the entire gure, and recognizing
protein and cell types.
Errors in optical character recognition can lead
to low accuracy in matching image pointers to panel
labels. Using regularities in the arrangement of the
labels (e.g., if the letters A through D are found as
image pointers and the panel labels are recognized
as A,B,G and D, then the G should be corrected to
a C) corrects some of the errors [7]. Using a test
set from PNAS, the precision of the nal matching
process was found to be 83% and the recall to be
74% [5].
Recognition of named entities (such as protein
and cell types) in free text is a dicult task that
may be even more dicult in condensed text such
as captions. We have implemented two schemes for
recognizing protein names. The rst (which is also
used for cell type recognition) uses pre x and sux
features along with immediate context to identify
candidate protein names. This approach has a low
precision but an excellent recall (which is useful to
enable database searches on abbreviations or syn-
onyms that might not be present in structured pro-
tein databases). The second approach [6] uses a dic-
tionary of names extracted from protein databases
in combination with soft match learning methods
to obtain a recall and precision above 70%. The
occurrences of the names found in the captions are
stored as being associated either with a panel or a
gure, depending on the scope in which the protein
name was found. The system also assigns subcellu-
lar locations to proteins using lookup of GO terms
in the Uniprot database, making it possible to nd
images depicting particular subcellular patterns.
Finally, the task of simply segmenting a paper
and extracting the caption, even without named
entity recognition or panel scoping, has proven very
2
Page 3
hidden
Figure 2: Slif Pipeline. This gure shows the general pipeline through papers are processed.
useful to our users, allowing easy search of free text
which can be limited to the captions, and therefore
the gures, of a paper.
4 Image Processing
Since, in most cases, gures are composed of mul-
tiple panels, the rst step in our image process-
ing pipeline is to divide the gures into panels.
We employ a gure-splitting algorithm that recur-
sively nds constant-intensity boundary regions in
between panels, a method which we have previ-
ously shown can e ectively split gures with com-
plex panel layouts [9].
Slif was originally designed to process only fmi
panels, and subsequent systems created by others
have included classi ers to distinguish other gure
types [11, 4]. We have now expanded the classi -
cation to other panel types: (1) fmi, (2) gel, (3)
graph or illustration, (4) photograph, (5) X-ray, or
(6) light microscopy. Using active learning [10], we
selected ca. 700 panels to label.
Given its importance to the working scientists,
we focused on the gel class. Currently, the system
proceeds through 3 classi cation levels: the rst
level, classi es the image into fmi or non-fmi us-
ing image based features (as previously reported);
the second level, uses textual features to identify
gels with high-precision (91%, and moderate recall:
66%); nally, if neither classi er has red, a general
purpose support vector machine classi er, operat-
ing on image-based features does the nal classi -
cation (accuracy: 61%).
Perhaps the most important task that slif sup-
ports is the classi cation of fmi panels based on the
depicted subcellular localization. To provide train-
ing data for pattern classi ers, we hand-labeled a
set of images into four di erent subcellular location
classes: (1) nuclear, (2) cytoplasmic, (3) punctate,
and (4) other, again selected through active learn-
ing.
We computed previously described features to
represent the image patterns. If the scale is in-
ferred from the image, then we normalize this fea-
ture value to square microns. Otherwise, we as-
sume a default scale of 1m=pixel. On the 3 main
classes (Nuclear, Cytoplasmic, and Punctate), we
obtained 75% accuracy (as before, reported accu-
racies are estimated using 10 fold cross-validation
and the classi er used was libsvm based). On the
four classes, we obtained 61% accuracy.
5 Topic Discovery
The goal of the topic discovery phase is to en-
able the user to structurally browse the otherwise
unstructured collection. This problem is reminis-
cent of the actively evolving eld of multimedia
information management and retrieval. However,
structurally-annotated biomedical gures pose a set
of new challenges due to the the hierarchical struc-
ture of the domain (panels contained within g-
ures) which results in scoped and global annotation
schemes, and the presence of various image annota-
tions (free form text, protein mentions,etc.) in the
caption with di erent frequency pro les.
Our model, the structured correspondence topic
model [1], addresses the aforementioned challenges
by extending the correspondence latent Dirichlet
allocation model that was successfully employed
for modeling annotated images [2]. The input to
the topic modeling system is the panel-segmented,
structurally and multimodally annotated biomedi-
cal gures. The goal of our approach is to discover a
set of latent themes in the collection. These themes
are called topics and serve as the basis for visual-
ization and semantic representation. Each biomed-
ical gure, panel, and protein entity is then rep-
resented as a distribution over these latent topics.
This uni ed representation enables comparing g-
3
Page 4
hidden
ures with radically di erent number of panels and
serves as the basis for various tasks like image-based
retrieval, text-based image retrieval, multimodal-
based image retrieval and image annotation. We
compared our model to various baselines with fa-
vorable results [1].
Furthermore, the latent topic representation fa-
cilitates the implementation of features such as
nding similar objects to an example that the user
has found as interesting (this can be done at any
level: panel, gure, or paper).
6 Discussion
We have presented slif, a system which analyzes
images in the biomedical literature. It processes
both text and image, combining them through la-
tent topic discovery. This enables users to browse
through a collection of papers by looking for re-
lated topics or images that are similar to an image
of interest.
Although it is crucial that individual components
achieve good results (and we have shown good re-
sults in our sub-tasks), good component perfor-
mance is not sucient for a working system. Slif
is a production system that has been shown to yield
usable results in real collections of papers.
The project is on-going and many avenues for
improvement are being exploited. Among those are
better semantic understanding of fmi data, more
advanced image processing of gels, exploitation of
the full-text, as well as a continuing improvement
of all the components in the pipeline.
6.1 Acknowledgements
Development of slif is suported by National Insti-
tutes of Health grant R01 GM078622.
References
[1] Amr Ahmed, Eric P. Xing, William W. Cohen,
and Robert F. Murphy. Structured Correspondence
Topic Models for Mining Captioned Figures in Bio-
logical Literature. In KDD '09: Proceedings of the
Fiftenth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2009.
[2] David M. Blei and Michael I. Jordan. Modeling
annotated data. In SIGIR '03: Proceedings of
the 26th annual international ACM SIGIR confer-
ence on Research and development in informaion re-
trieval, pages 127{134, New York, NY, USA, 2003.
ACM Press.
[3] William W. Cohen, Richard Wang, and Robert F.
Murphy. Understanding captions in biomedical pub-
lications. In KDD '03: Proceedings of the ninth
ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pages 499{504, New
York, NY, USA, 2003. ACM.
[4] Jan-Mark Geusebroek, Minh Anh Hoang, Jan van
Gernert, and Marcel Worring. Genre-based search
through biomedical images. volume 1, pages 271{
274, 2002.
[5] Zhenzhen Kou, William W. Cohen, and Robert F.
Murphy. Extracting information from text and im-
ages for location proteomics. In Mohammed Javeed
Zaki, Jason Tsong-Li Wang, and Hannu Toivonen,
editors, BIOKDD, pages 2{9, 2003.
[6] Zhenzhen Kou, William W. Cohen, and Robert F.
Murphy. High-recall protein entity recognition using
a dictionary. In Bioinformatics, vol. 21, pages i266{
273, 2005.
[7] Zhenzhen Kou, William W. Cohen, and Robert F.
Murphy. A stacked graphical model for associating
sub-images with sub-captions. In Russ B. Altman,
A. Keith Dunker, Lawrence Hunter, Ti any Mur-
ray, and Teri E. Klein, editors, Paci c Symposium
on Biocomputing, pages 257{268. World Scienti c,
2007.
[8] Robert F. Murphy, Zhenzhen Kou, Juchang Hua,
Matthew Jo e, and William W. Cohen. Extract-
ing and structuring subcellular location information
from on-line journal articles: The subcellular loca-
tion image nder. In IASTED International Confer-
ence on Knowledge Sharing and Collaborative En-
gineering, pages 109{114, 2004.
[9] Robert F. Murphy, Meel Velliste, Jie Yao, and Gre-
gory Porreca. Searching online journals for
uo-
rescence microscope images depicting protein sub-
cellular location patterns. In BIBE '01: Proceed-
ings of the 2nd IEEE International Symposium on
Bioinformatics and Bioengineering, pages 119{128,
Washington, DC, USA, 2001. IEEE Computer So-
ciety.
[10] Nicholas Roy and Andrew Mccallum. Toward op-
timal active learning through sampling estimation
of error reduction. In In Proc. 18th International
Conf. on Machine Learning, pages 441{448. Mor-
gan Kaufmann, 2001.
[11] Hagit Shatkay, Nawei Chen, and Dorothea
Blostein. Integrating image data into biomedical
text categorization. In Bioinformatics, vol. 22,
pages 446{453, 2006.
4

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

4 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
25% Doctoral Student
 
25% Ph.D. Student
 
25% Researcher (at a non-Academic Institution)
by Country
 
50% Italy
 
25% United Kingdom
 
25% United States