Sign up & Download
Sign in

Structured Literature Image Finder: Extracting Information from Text and Images in Biomedical Literature.

by Luís Pedro Coelho, Amr Ahmed, Andrew Arnold, Joshua Kangas, Abdul-Saboor Sheikh, Eric P Xing, William W Cohen, Robert F Murphy
Lecture Notes in Computer Science (2010)

Abstract

SLIF uses a combination of text-mining and image processing to extract information from figures in the biomedical literature. It also uses innovative extensions to traditional latent topic modeling to provide new ways to traverse the literature. SLIF provides a publicly available searchable database originally focused on fluorescence microscopy images. We have now extended it to classify panels into more image types. We also improved the classification into subcellular classes by building a more representative training set. To get the most out of the human labeling effort, we used active learning to select images to label.We developed models that take into account the structure of the document (with panels inside figures inside papers) and the multi-modality of the information (free and annotated text, images, information from external databases). This has allowed us to provide new ways to navigate a large collection of documents.

Cite this document (BETA)

Available from Luis Pedro Coelho's profile on Mendeley.
Page 1
hidden

Structured Literature Image Finder: Extracting Information from Text and Images in Biomedical Literature.

Structured Literature Image Finder: Extracting
Information from Text and Images in Biomedical
Literature
Lus Pedro Coelho1;2;3, Amr Ahmed4;5, Andrew Arnold4, Joshua Kangas1;2;3,
Abdul-Saboor Sheikh3, Eric P. Xing4;5;6, William W. Cohen1;4, and
Robert F. Murphy1;2;3;4;6;7
1 Lane Center for Computational Biology, Carnegie Mellon University
2 Joint Carnegie Mellon University{University of Pittsburgh Ph.D. Program in
Computational Biology
3 Center for Bioimage Informatics, Carnegie Mellon University
4 Machine Learning Department, Carnegie Mellon University
5 Language Technologies Institute, Carnegie Mellon University
6 Department of Biological Sciences, Carnegie Mellon University
7 Department of Biomedical Engineering, Carnegie Mellon University
Abstract. Slif uses a combination of text-mining and image processing
to extract information from gures in the biomedical literature. It also
uses innovative extensions to traditional latent topic modeling to provide
new ways to traverse the literature. Slif provides a publicly available
searchable database (http://slif.cbi.cmu.edu).
Slif originally focused on
uorescence microscopy images. We have now
extended it to classify panels into more image types. We also improved
the classi cation into subcellular classes by building a more representa-
tive training set. To get the most out of the human labeling e ort, we
used active learning to select images to label.
We developed models that take into account the structure of the docu-
ment (with panels inside gures inside papers) and the multi-modality
of the information (free and annotated text, images, information from
external databases). This has allowed us to provide new ways to navigate
a large collection of documents.
1 Introduction
Thousands of papers are published each day in the biomedical domain. Working
scientists therefore struggle to keep up with all the results that are relevant to
them. Traditional approaches to this problem have focused solely on the text
of papers. However, images are also very important as they often contain the
primary experimental results being reported. A random sampling of such g-
ures in the publicly available PubMed Central database reveals that in some,
if not most of the cases, a biomedical gure can provide as much information
as a normal abstract. Thus, researchers in the biomedical eld need automated
systems that can help them nd information quickly and satisfactorily. These
Page 2
hidden
systems should provide them with a structured way of browsing the otherwise
unstructured knowledge in a way that inspires them to ask questions that they
never thought of before.
Our team developed the rst system for automated information extraction
from images in biological journal articles (slif, the \Subcellular Location Image
Finder," rst described in 2001 [1]). Since then, we have reported a number of
improvements to the SLIF system [2{4]. In part re
ecting this, we are rechris-
tening slif as the \Structured Literature Image Finder."
Most recently, we have added support for more image types, improved clas-
si cation methods, and added features based on multi-modal latent topic mod-
eling. Topic modeling allows for innovative user-visible features such as \browse
by topic," retrieval of topic-similar images or gures, or interactive relevance
feedback. Traditional latent topic approaches have had to be adapted to the
setting where documents are composed of free and annotated text and images
arranged in a structured fashion. We have also added a powerful tool for orga-
nizing gures by topics inferred from both image and text, and have provided a
new interface that allows browsing through gures by their inferred topics and
jumping to related gures from any currently viewed gure. We have performed
a user study where we asked users to perform typical tasks with slif and report
whether they found the tool to be useful. The great majority of responses were
very positive [5].
Slif provides both a pipeline for extracting structured information from
papers and a web-accessible searchable database of the processed information.
Users can query the database for various information appearing in captions or
images, including speci c words, protein names, panel types, patterns in gures,
or any combination of the above.
2 Overview
Fig. 1. Overview of slif pipeline
Page 3
hidden
The slif processing pipeline is illustrated in Figure 1. After preprocessing,
image and caption processing proceed in parallel. The results of these two mod-
ules then serve as input to the topic modeling framework.
The rst step in image processing is to split the image into its panels, then
identify the type of image in each panel. If the panel is a
uorescence micrograph
image (fmi), the depicted subcellular localization is automatically identi ed [1].
In addition, panel labels are identi ed through optical character recognition,
and scale-bars, if present, are identi ed. Annotations such as white arrows are
removed.
In parallel, the caption is parsed and relevant biological entities (protein
and cell types) are extracted from the caption using named entity recognition
techniques. Also, the caption is broken up into logical scopes (sub-captions,
identi ed by markers such as \(A)"), which will be subsequently linked to panels.
The last step in the pipeline aggregates the results of image and caption
processing by using them to infer underlying themes in the collection of papers.
These are based on the free text in the caption, on the annotated text (i.e.,
protein and cell type names are not processed as simple text), and the image
features and subcellular localization. This results in a low-dimensional represen-
tation of the data, which is used to implement retrieval by example (\ nd similar
papers") or even interactive relevance feedback navigation.
Access to the results of this pipeline is provided via a web interface or pro-
gramatically with soap queries. Results presented always link back to the full
paper for user convenience.
3 Caption Processing
A typical caption, taken from [6], reads as:
S1P induces relocalization of both p130Cas and MT1-MMP to pe-
ripheral actin-rich structures. (A) HUVEC were stimulated for 15 min
with 1 M S1P and stained with polyclonal MT1-MMP [. . . ]. (B) Cells
were stimulated with S1P as described above [. . . ]. Scale bars are 10m.
We have highlighted, in bold, the pieces of information which are of interest
to slif: The text contains both a global portion (the rst sentence) and portions
scoped to particular panels (marked by \(A)" and \(B)"). Thus the caption is
broken up into three parts, one global, and two speci c to a panel. In order to un-
derstand what the image represents, slif extracts the names of proteins present
(p130Cas, MT1-MMP,. . . ), as well as the cell line (HUVEC) using techniques
described previously. Additionally, slif extracts the length(s) of any scale bars
to be associated with scale bars extracted from the image itself.
The implementation of this module is described in greater detail elsewhere [2,
4, 5, 7].
Page 4
hidden
4 Image Processing
4.1 Figure Splitting
The rst step in our image processing pipeline is to divide the extracted gures
into their constituent components, since in majority of the cases, the gures are
comprised of multiple panels to depict similar conditions, corresponding analysis,
etc. For this purpose, we employ a gure-splitting algorithm that recursively nds
constant-intensity boundary regions to break up the image hierarchically. Large
regions are considered panels, while small regions are most likely annotations.
This method that was previously shown to perform well [1].
4.2 \Ghost" Detection
(a) Color image (b) Blue channel
Fig. 2. Example of a ghost image. Although the color image is obviously a two-channel
image (red and green), there is a strong bleed-through into the blue component.
Fmi panels are often false color images composed of related channels. How-
ever, due to treatment of the image for publication or compression artifacts,
it is common that an image that contains one or two logical colors (and is so
perceived by the human reader), will have signal in all three color channels.
The extra channel, we call a \ghost" of the signal-carrying channels. Figure 2
illustrates this phenomenon.
To detect ghosts, we rst compute the white component of the image, i.e.,
the pixel-wise minimum of the 3 channels. We then subtract this component
from each channel so that the regions with homogeneous intensities across all
channels (e.g. annotations or pointers) get suppressed. Then, for each channel,
we verify if its 95%-percentile pixel is at least 10% of the overall highest pixel
value. These two values were found empirically to reject almost all ghosts, with a
Page 5
hidden
low rate of false negatives (a signal carrying channel that has less than 5% bright
pixels will be falsely rejected, but we found the rate of false positives to be low
enough to be acceptable). Algorithm 1 illustrates this process in pseudo-code.
Algorithm 1: Ghost Detection Algorithm
White := pixelwise-min(R,G,B)1
M := max( RWhite, GWhite, BWhite )2
foreach ch 2 (R,G,B) do3
Residual := chWhite4
sort pixels from Residual5
if 95% percentile pixel < 10%M then6
ch is a ghost7
4.3 Panel Type Classi cation
Slif was originally designed to process only fmi panels. Recently, we expanded
the classi cation to other panel types, in a way similar to other recent systems [8{
10].
Panels are classi ed into one of six panel classes: (1) fmi, (2) gel, (3) graph or
illustration, (4) light microscopy, (5) X-ray, or (6) photograph. To build a train-
ing set for this classi cation problem, while minimizing labeling e ort, we used
empirical risk reduction, an active learning algorithm [11]. We used a libsvm-
based classi er as the base algorithm. In order to speed up the process, at each
round, we labeled the 10 highest ranked images plus 10 randomly selected im-
ages. The process was seeded by initially labeling 50 randomly selected images.
This resulted in ca. 700 labeled images.
The previous version of slif already had a good fmi classi er, which we have
kept. Given its frequency and importance, we focused on the gel class as the next
important class. Towards this goal, we de ne a set of features based on whether
certain marker words appeared in the caption that would signal gels8 as well
as a set of substrings for the inverse class9. A classi er based on these boolean
features was learned using the id3 decision tree algorithm [12] with precision on
the positive class as the function being maximized. This technique was shown,
through 10 fold cross-validation, to obtain very high precision (91%) at the cost
of moderate recall (66%). Therefore, examples considered positive are labeled
as such, but examples considered negative are passed on to a classi er based
on image features. In addition to the features developed for fmi classi cation,
we measure the fraction of variance that remains in the image formed by the
di erences between horizontally adjacent pixels:
8 The positive markers were: Western, Northern, Southern, blot, lane, RT (for \reverse
transcriptase"), RNA, PAGE, agarose, electrophoresis, and expression.
9 The negative markers were: bar (for bar charts), patient, CT, and MRI.
Page 6
hidden
h(I) =
var(Iy;x1 Iy;z)
var(Iy;x)
: (1)
Gels, consisting of horizontal bars, score much lower on this measure than other
types of images. Furthermore, we used 26 Haralick texture features [13]. Images
were then classi ed into the six panel type classes using a support vector machine
(svm) based classi er. On this system, we obtain an overall accuracy of 69%.
Therefore, the system proceeds through 3 classi cation levels: the rst level
classi es the image into fmi or non-fmi using image based features; the sec-
ond level uses the textual features described above to identify gels with high-
precision; nally, if both classi ers gave negative answers, an svm operating on
image-based features does the nal classi cation.
4.4 Subcellular Location Pattern Classi cation
Perhaps the most important task that slif supports is to extract information
based on the subcellular localization depicted in fmi panels.
To provide training data for pattern classi ers, we hand-labeled a set of im-
ages into four di erent subcellular location classes: (1) nuclear, (2) cytoplasmic,
(3) punctate, and (4) other, following the active learning methodology described
above for labeling panel types. The active learning loop was seeded using images
from a HeLa cell image collection that we have previously used to demonstrate
the feasibility of automated subcellular pattern classi cation [14].
The dataset was ltered to remove images that, once thresholded using the
methods we described previously [14], led to less than 80 above-threshold pixels,
a value which was empirically determined. This led to the rejection of 4% of
images. In classi cation, if an image meets the rejection criterion, it is assigned
into a special don't know class.
We computed previously described eld-level features to represent the image
patterns ( eld-level features are features that do not require segmentation of
images into individual cell regions). We added a new feature for the size of
the median object (which is a more robust statistic than the previously used
mean object size). Experiments using stepwise discriminant analysis as a feature
selection algorithm [15] showed that this was an informative feature. If the scale is
inferred from the image, then we normalize this feature value to square microns.
Otherwise, we assume a default scale of 1m=pixel.
We also adapted the threshold adjacency statistic features (tas) from Hamil-
ton et al. [16] to a parameter-free version. The original features depended on a
manually controlled-two-step binarization of the image. For the rst step, we use
the Ridler{Calvard algorithm to identify a threshold instead of a xed thresh-
old [17]. The second binarization step involves nding those pixels that fall into
a given interval such as [M;+M ], where  is the average pixel value of the
above-threshold pixel and M is a margin (set to 30 in the original paper). We
Page 7
hidden
set M to the standard deviation of the above threshold pixels.10 We call these
parameter-free tas.
On the 3 main classes (Nuclear, Cytoplasmic, and Punctate), we obtained
75% accuracy (as before, reported accuracies are estimated using 10 fold cross-
validation and the classi er used was an svm). On the four classes, we obtained
61% accuracy.
4.5 Panel and Scope Association
As discussed above, gures are composed of a set of panels and a set of subim-
ages which are too small to be panels. To associate panels with their caption
pointers (e.g., identifying which panel is panel \A" if such a mention is made in
the caption), we parse all panels and other sub-images using optical character
recognition (ocr). In the simple case, the panel contains the panel annotation
and there is a one-to-one match to annotations in the caption. Otherwise, we
match panels to the nearest found in-image annotation.
5 Topic Discovery
The previous modules result in panel-segmented, structurally and multi-modally
annotated gures: each gure is composed of multiple panels, and the caption
of the whole gure is parsed into scoped caption, global caption, and protein
entities. Each scoped caption is associated with a single panel and the global
caption is shared across panels and provides contextual information. Given this
organization, we would like to build a system for querying across modality and
granularity. For instance, the user might want to search for biological gures
given a query composed of key words and protein names (across-modality), or the
user might want to retrieve gures similar to a given panel (across-granularity)
or a given other gure of interest. In this section, we describe our approach to
address this problem using topic models.
Topic models aim towards discovering a set of latent themes present in the
collection of papers. These themes are called topics and serve as the basis for
visualization and semantic representation. Each topic k consists of a triplet of
distributions: a multinomial distribution over words k, a multinomial distribu-
tion of protein entities
k, and a gaussian distribution over every image feature
s, (k;s; k;s). Given these topics, a graphical model is de ned that generates
gure f given these topics (see [18] for a full description). There are two main
steps involved in building our topic model: inference and learning. In learning,
given a set of gures, the goal is to learn the set of topics ( k;
k; fk;s; k;sg)
that generates the collection using Bayesian inference [18]. On the other hand,
given the discovered topics and a new gure f , the goal of inference is to de-
duce the latent representation of this gure f = (f;1    f;k), where the com-
ponent f;k de nes how likely topic k will appear in gure f . Moreover, for
10 Other methods for binarizing the image presented by Hamilton et al. are handled
analogously.
Page 8
hidden
each panel p in gure f , the inference step also deduces its latent represen-
tation: f;p = (f;p;1    f;p;k). In addition, from the learning step, each word
w and protein entity r can also be represented as a point in the topic space:
w = ( 1;w;    ; k;w) and r = (
1;r;    ;
k;r).
This results in a uni ed space where each gure, panel, word and protein
entity is described using a point in this space which facilitates querying across
modality and granularity. For instance, given a query q = (w1;    ; wn; r1;    ; rm)
composed of a set of text words and protein entities, we can rank gures accord-
ing to this query using the query language model [19] as follows:
P (qjf) =
Y
w2q
P (wjf)
Y
r2q
P (rjf) =
Y
w2q
hX
k
f;k k;w
iY
r2q
hX
k
f;k
k;r
i
=
Y
w2q
h
f w
iY
r2q
h
f r
i
(2)
Equation 2 is a simple dot product operation between the latent representations
of each query item and the latent representation of the gure in the induced
topical space. The above measure can then be used to rank gures for retrieval.
Moreover, given a gure of interest f , other gures in the database can be ranked
based on similarity to this gure as follows:
sim(f 0jf) =
X
k
f;kf 0;k = f f 0 (3)
In addition to the above capabilities, the discovered topics endow the user
with a bird's eye view over the paper collection and can serve as the basis for
visualization and structured browsing. Each topic f summarizes a theme in the
collection and can be represented to the user along three dimensions: top words
(having high values of k;w), top proteins entities (having high values of
k;r),
and a set of representative panels (panels with high values of f;p;k). Users can
decide to display all panels ( gures) that are relevant to a particular topic of
interest [18, 5].
6 Discussion
We have presented a new version of slif, a system that analyzes images and
their associated captions in biomedical papers. Slif demonstrates how text-
mining and image processing can intermingle to extract information from sci-
enti c gures. Figures are broken down into their constituent panels, which are
handled separately. Panels are classi ed into di erent types, with the current
focus on fmi and gel images. Fmis are further processed by classifying them
into their depicted subcellular location pattern. The results of this pipeline are
made available through either a web-interface or programmatically using soap
technology.
A new addition to our system is latent topic discovery which is performed
using both text and image. This is based on extending traditional models to
Page 9
hidden
handle the structure of the literature and allows us to customize these models
with domain knowledge (by integrating the subcellular localization looked up
from a database, we can see relations between papers using knowledge present
outside of them).
Our most recent human-labeling e orts (of panel types and subcellular lo-
cation) were performed using active learning to extract the most out of the
human e ort. We plan to replicate this approach in the future for any other
labeling e ort (e.g., adding a new collection of papers). Our current labeling
e orts were necessary to collect a dataset that mimicked the characteristics of
the task at hand (images from published literature) and improve on our pre-
vious use of datasets that did not show all the variations present in real pub-
lished datasets. These datasets are available for download from the slif webpage
(http://slif.cbi.cmu.edu) so that they can be used by other system developers
and for building improved pattern classi ers.
6.1 Acknowledgments
The slif project is currently supported by NIH grant R01 GM078622. L.P.C. was
partially supported by a grant from the Fundac~ao Para a Cie^ncia e Tecnologia
(grant SFRH/BD/37535/2007).
References
1. Murphy, R.F., Velliste, M., Yao, J., Porreca, G.: Searching online journals for
u-
orescence microscope images depicting protein subcellular location patterns. In:
BIBE '01: Proceedings of the 2nd IEEE International Symposium on Bioinformat-
ics and Bioengineering, Washington, DC, USA, IEEE Computer Society (2001)
119{128
2. Cohen, W.W., Wang, R., Murphy, R.F.: Understanding captions in biomedical
publications. In: KDD '03: Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, New York, NY, USA, ACM
(2003) 499{504
3. Murphy, R.F., Kou, Z., Hua, J., Jo e, M., Cohen, W.W.: Extracting and structur-
ing subcellular location information from on-line journal articles: The subcellular
location image nder. In: Proceedings of IASTED International Conference on
Knowledge Sharing and Collaborative Engineering. (2004) 109{114
4. Kou, Z., Cohen, W.W., Murphy, R.F.: A stacked graphical model for associating
sub-images with sub-captions. In Altman, R.B., Dunker, A.K., Hunter, L., Murray,
T., Klein, T.E., eds.: Proceedings of the Paci c Symposium on Biocomputing,
World Scienti c (2007) 257{268
5. Ahmed, A., Arnold, A., Coelho, L.P., Kangas, J., Sheikh, A.S., Xing, E.P., Cohen,
W.W., , Murphy, R.F.: Structured literature image nder: Parsing text and gures
in biomedical literature. Journal of Web Semantics (in press) (2009)
6. Gingras, D., Michaud, M., Tomasso, G.D., Bliveau, E., Nyalendo, C., Bliveau,
R.: Sphingosine-1-phosphate induces the association of membrane-type 1 matrix
metalloproteinase with p130cas in endothelial cells. FEBS Letters 582(3) (2008)
399 { 404
Page 10
hidden
7. Kou, Z., Cohen, W.W., Murphy, R.F.: High-recall protein entity recognition using
a dictionary. Bioinformatics 21 (2005) i266{i273
8. Geusebroek, J.M., Hoang, M.A., van Gernert, J., Worring, M.: Genre-based search
through biomedical images. In: Proceedings of 16th International Conference on
Pattern Recognition. Volume 1. (2002) 271{274 vol.1
9. Shatkay, H., Chen, N., Blostein, D.: Integrating image data into biomedical text
categorization. Bioinformatics 22(14) (2006) e446{453
10. Rafkind, B., Lee, M., Chang, S., Yu, H.: Exploring text and image features to
classify images in bioscience literature. In: Proceedings of the BioNLP Workshop on
Linking Natural Language Processing and Biology at HLT-NAACL, Morristown,
NJ, USA, Association for Computational Linguistics (2006) 73{80
11. Roy, N., Mccallum, A.: Toward optimal active learning through sampling estima-
tion of error reduction. In: Proc. 18th International Conf. on Machine Learning,
Morgan Kaufmann (2001) 441{448
12. Mitchell, T.M.: Machine Learning. McGraw-Hill (1997)
13. Haralick, R.M.: Statistical and structural approaches to texture. Proceedings of
the IEEE 67 (1979) 786{804
14. Boland, M.V., Murphy, R.F.: A neural network classi er capable of recognizing
the patterns of all major subcellular structures in
uorescence microscope images
of HeLa cells. Bioinformatics 17(12) (2001) 1213{1223
15. Jennrich, R.: Stepwise Regression & Stepwise Discriminant Analysis. In: Statistical
Methods for Digital Computers. John Wiley & Sons, Inc, New York (1977) 58{95
16. Hamilton, N., Pantelic, R., Hanson, K., Teasdale, R.: Fast automated cell pheno-
type image classi cation. BMC Bioinformatics 8(1) (2007) 110
17. Ridler, T., Calvard, S.: Picture thresholding using an iterative selection method.
IEEE Trans. Systems, Man and Cybernetics 8(8) (August 1978) 629{632
18. Ahmed, A., Xing, E.P., Cohen, W.W., Murphy, R.F.: Structured correspondence
topic models for mining captioned gures in biological literature. In: Proceedings
of The Fifteenth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD 2009), New York, NY, USA, ACM (2009) 39{47
19. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval.
In: SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference
on Research and development in information retrieval, New York, NY, USA, ACM
(1998) 275{281

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

11 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
36% Researcher (at an Academic Institution)
 
18% Researcher (at a non-Academic Institution)
 
18% Student (Postgraduate)
by Country
 
45% United States
 
27% Germany
 
9% India