Unsupervised Unmixing of Subcellular Location Patterns
Abstract
With the advent of high-throughput microscopes, researchers can routinely image hundreds of different proteins per day, generating thousands of images. To be able to organize these images and extract meaningful information, we need automatic methods. The state-of-the-art in automated subcellular localization is classification in the space of image features. This approach is not suited, however, for handling mixture patterns (the pattern of a protein present in more than one location). We have previously described methods for determining the fraction of fluorescence in various subcellular locations when the basic locations in which a protein can be present are given a priori. However, knowing all fundamental patterns a priori may be problematic. The alternative is unsupervised unmixing: given a set of images from different proteins, identify the basic patterns that best explain all the observed images as either examples of such basic patterns or combinations thereof. We extend our previous work to handle this problem. Using a validation dataset, we show that this method can recover the underlying mixed patterns. It identifies meaningful basis patterns and mixture coefficients that correlate well with the probe concentrations that generated the dataset (the probe concentrations were kept hidden from the algorithm).
Author-supplied keywords
Unsupervised Unmixing of Subcellular Location Patterns
Lus Pedro Coelho lpc@cmu.edu
Robert F. Murphy murphy@cmu.edu
Lane Center for Computational Biology, Carnegie Mellon University and Joint Carnegie Mellon University-
University of Pittsburgh Ph.D. Program in Computational Biology, 4400 Fifth Ave, Pittsburgh, PA 15217 USA
Abstract
With the advent of high-throughput micro-
scopes, researchers can routinely image hun-
dreds of dierent proteins per day, generating
thousands of images. To be able to organize
these images and extract meaningful infor-
mation, we need automatic methods. The
state-of-the-art in automated subcellular lo-
calization is classication in the space of im-
age features. This approach is not suited,
however, for handling mixture patterns (the
pattern of a protein present in more than one
location).
We have previously described methods for de-
termining the fraction of
uorescence in var-
ious subcellular locations when the basic lo-
cations in which a protein can be present are
given a priori. However, knowing all fun-
damental patterns a priori may be problem-
atic. The alternative is unsupervised unmix-
ing: given a set of images from dierent pro-
teins, identify the basic patterns that best ex-
plain all the observed images as either exam-
ples of such basic patterns or combinations
thereof.
We extend our previous work to handle this
problem. Using a validation dataset, we show
that this method can recover the underlying
mixed patterns. It identies meaningful basis
patterns and mixture coecients that corre-
late well with the probe concentrations that
generated the dataset (the probe concentra-
tions were kept hidden from the algorithm).
1. Introduction
The current best method for determining protein lo-
cation is by
uorescence imaging of tagged cells. For
proteome-wide studies of subcellular localization the
amounts of data are so large that automated methods
are required.
The traditional approach for automated protein sub-
cellular localization from images is a discriminative
feature-based approach. In this approach, a set of
features is computed from each image and compu-
tation proceeds in this space of features, where ma-
chine learning methods can be applied. This approach
has shown good results when used to identify images
based on training data or for clustering unlabeled im-
ages (Boland & Murphy, 2001; Chen & Murphy, 2005;
Hamilton et al., 2007).
The discriminative approach is not suited, however,
for handling mixture patterns. A mixture of patterns
occurs when a protein (or other marker) is present in
more than one location. For example, one expects that
some proteins are present in endosomes, others in lyso-
somes, while others will be present in both of these.
With the discriminative approach, this will often re-
sult in three independent classes, a situation where the
relationship between the pattern classes is neither rep-
resented nor discoverable from data. In general, there
is no reason to expect that feature values of a mixture
pattern will have any meaningful relationship with the
feature values of the base patterns that compose it.
We have previously described methods for determin-
ing the fraction of
uorescence in various subcellular
patterns (Zhao et al., 2005; Peng et al., 2009). How-
ever, this \pattern unmixing" approach requires that
the user specify the fundamental patterns of which all
patterns can be comprised.
Deciding in advance which are the fundamental pat-
terns that occur in a large collection of images may
be challenging and require a lot of guessing by the re-
searcher. The alternative is unsupervised unmixing:
given a set of images with dierent proteins tagged,
identify the basic patterns that best explain all the
observed images as either examples of such basic pat-
terns or combinations thereof.
The main contribution of this paper is a method to
solve the unsupervised subcellular pattern unmixing
problem. This method is validated in a test dataset
where two probes were mixed in known concentrations.
These concentrations were, however, kept hidden from
the algorithm and are used for validation only.
2. Methods
The methods presented here are an extension of the
methods presented previously for supervised pattern
unmixing (Zhao et al., 2005).
The algorithm starts by processing each image to ex-
tract salient objects. Objects are described by an
11 dimensional feature vector (using the same features
as in (Zhao et al., 2005)). These features are used
to cluster objects into object types. The intuition
is that dierent patterns consist of dierent object
types, while mixed patterns show objects from their
constituent classes in proportion to the mixture coef-
cients.
Object Detection
Objects are dened as contiguous regions of non-
background
uorescence. The previous work on pat-
tern unmixing used a global threshold to nd ob-
jects (Zhao et al., 2005). For the images used in
the current study, we found that global thresholding
methods correctly dierentiate between the cell re-
gion and background, but do not dierentiate between
bright objects inside the cell and general cellular
u-
orescence. Local thresholds, on the other hand, work
as intended inside the cell region, but pick up noise
in background regions. Therefore, we employ a hybrid
method and consider a pixel to be inside an object only
if it is above both a global and a locally determined
threshold. Once images have been binarized, objects
are dened as above threshold contiguous regions. In
this work, we used Ridler-Calvard (Ridler & Calvard,
1978) for global thresholding and the mean value of the
15 15 pixel window centred on the pixel of interest
as the local threshold.
Clustering and Unmixing
Each object is characterized by its feature vector (nor-
malized to z-scores). All the objects in the collection
are then clustered using k-means clustering. For each
value of k, 10 random restarts are performed. The -
nal number of clusters is determined by the Bayesian
information criterion.
After clustering, each object is assigned a cluster in-
dex. An image can then be summarised by simply
counting the number of its constituent objects that
belong to each class, i.e., for each image we obtain a
vector of counts x 2 Rd, where d is the number of
object clusters.
In order to compensate for diering cell counts in each
image, we normalise the vector x to be a vector of ob-
ject fractions. Furthermore, some preliminary results
show that the t was dominated by frequent object
types, which are present in great numbers in many
images. Being so common, these objects were not dis-
criminative. This led us to remove object types that
appear in over 90% of images.
Our generative model for images is that a set of basis
vectors B = fbjgj combines with a vector of counts
to form an image by
x =
X
j
jbj (1)
In the unsupervised unmixing problem, we wish to in-
vert the generative process. Given a set of images rep-
resented as vectors of counts D = fxigi, we need to
nd a set of basis vectors B = fbjgj and mixing coef-
cients ij , such that
xi =
X
j
ijbj + "i; (2)
where we wish to minimize
P
i "
2
i , subject to the con-
straint ij 0.
We add additional restrictions that guide the system
towards a more meaningful answer. In particular, we
restrict possible basis vectors to elements of the col-
lection (i.e., bj 2 D) . This encodes the idea that our
collection is broad enough to contain examples of the
base classes and aids the interpretability of the nal
result. In the Discussion section, we re
ect upon the
implications of this choice.
Additionally, we consider that each mixture is a mix-
ture of a small number of bases and therefore, bias
the search towards sparser solutions. As the simplest
implementation of this idea, we constrain the mixture
vector i corresponding to image i to have a small
number of non-zero components.
3. Results
Dataset and Criteria
In order to validate this model, we used a test dataset
of images created for the rst validation of subcellular
pattern unmixing in real images for cells (Peng et al.,
(a) Lyso-dominant (b) Mitotracker
Figure 1. Examples of selected bases. These are the rst images corresponding to the selected samples. Images are false
color panels, with red showing the nuclear channel and green the tracker channel. Images have been manually contrast
stretched for publication.
2009). In this collection, cells were labeled with 8 dif-
ferent concentrations of lysotracker and mitotracker
(for a total of 64 possible combinations). The emis-
sion spectra of both probes are similar, and thus the
amount of each added cannot be distinguished using
lters. This mimics the situation in which a protein
is present in varying amounts in lysosomes and mito-
chondria.
Two criteria were used for evaluating the quality of the
solution obtained:
1. The discovered basis sets should correspond to the
basis patterns (one basis vector corresponding to
the lysosomal pattern and the other to the mito-
chondrial pattern).
2. The inferred mixture coecients ij for a given
sample should correlate with the actual probe con-
centrations used to label that sample. We will
compare our results to those obtained by super-
vised unmixing.
Unmixing Results
The algorithm returned two of the test samples as the
bases for unmixing. One of the basis was a pure mito-
tracker sample, while in the other basis, lysotracker is
dominant. Figure 1 shows the rst image correspond-
ing to each basis as examples.
Given that fxigi vectors were normalised to one, we
compare the coecients obtained from the algorithms
with the underlying concentration fractions. For some
images, the estimated fractions were zero. Discarding
those images leads to correlations with the underlying
fractions of 77% and 67%, respectively. This compares
with 84% and 71% in the supervised case. The unsu-
pervised algorithm does not do as well as the super-
vised version, but the dierence is not very large. The
full results are displayed as heat maps in Figure 2,
where we can see that the unsupervised results are
qualitatively very similar to the supervised results.
4. Discussion
We have presented an extension of the supervised un-
mixing problem to handle the unsupervised case. The
model is based on automatic object detection and clus-
tering. Images are then summarised by a simple vec-
tor in a low dimensional space and unmixing proceeds
there. This extends our previously presented methods
for supervised unmixing.
A validation dataset where conditions were known
showed that the method can recover most of the struc-
ture of mixture patterns. One of the returned bases
was a pure mitotracker pattern, while the other was a
lysotracker dominant sample (Figure 1). The inferred
mixture coecients are highly correlated with the un-
derlying concentrations that were used to generate the
dataset.
The additive model we presented is a variation on
traditional types of unsupervised dimensionality re-
duction. However, the restrictions we added, par-
ticularly the restriction that basis vectors be present
in the source dataset, signicantly change the nature
of the problem and the interpretability of the solu-
tion. They rule out approaches based on principal
component analysis or non-negative matrix approxi-
mation (Lee & Seung, 2000; Berry et al., 2007). We
0 29 49 84 142 242 412 700
0
40
56
78
109
153
214
300
(a) Supervised Unmixing
0 29 49 84 142 242 412 700
0
40
56
78
109
153
214
300
(b) Unsupervised Unmixing
Figure 2. Estimation results of mixture fractions for supervised and unsupervised unmixing. In both the case of supervised
and of unsupervised unmixing, we plot the estimate of the fraction of lysosomal pattern as a function of the hidden
concentration of mitotracker and lysotracker. In both cases, dark brown corresponds to 0.0 and white corresponds to 1.0.
Mitotracker concentration varies along the horizontal axis, while lysotracker varies along the vertical axis.
believe that the solutions obtained by our method are
more appropriate when the goal is to organise a large
collection of images in a meaningful way. That we
can point to example images such as those presented
in Figure 1 allows one to easily communicate the re-
sults of the algorithm. Other dimensionality reduction
procedures often require a posteriori interpretation of
the inferred basis, which might be appropriate in some
domains (e.g., interpreting a set of words in a textual
problem), but would be cumbersome for large sets of
images.
The validation dataset used here was obtained using
an automated microscope and used without any hand
ltering of the data or special processing. This enables
us to use this method in the large-data setting where
human inspection of the images is impossible.
We are currently working on a graphical user inter-
face to these methods. This will make our software
implementation usable by a wider audience.
Acknowledgements
This work was funded by NIH grant GM075205. LPC
was partially funded by the Fundac~ao Para a Cie^ncia
e Tecnologia (grant SFRH/BD/37535/2007) as well as
a fellowship from the Fulbright Program.
The authors thank Tao Peng, Estelle Glory, Ghislain
Bonami, Sumit Chanda, and Daniel Rines for provid-
ing images as well as many helpful discusions.
References
Berry, M. W., Browne, M., Langville, A. N., Pauca,
P. V., & Plemmons, R. J. (2007). Algorithms
and applications for approximate nonnegative ma-
trix factorization. Computational Statistics & Data
Analysis, 52, 155{173.
Boland, M. V., & Murphy, R. F. (2001). A neural net-
work classier capable of recognizing the patterns of
all major subcellular structures in
uorescence mi-
croscope images of HeLa cells. Bioinformatics, 17,
1213{1223.
Chen, X., & Murphy, R. F. (2005). Objective clus-
tering of proteins based on subcellular location pat-
terns. Journal of Biomedicine and Biotechnology,
2005, 87{95. doi:10.1155/JBB.2005.87.
Hamilton, N., Pantelic, R., Hanson, K., & Teasdale,
R. (2007). Fast automated cell phenotype image
classication. BMC Bioinformatics, 8, 110.
Lee, D. D., & Seung, S. H. (2000). Algorithms for non-
negative matrix factorization. NIPS (pp. 556{562).
Peng, T., Bonamy, G. M., Glory, E., Daniel Rines, S.
K. C., & Murphy, R. F. (2009). Automated unmix-
ing of subcellular patterns: Determining the distri-
bution of probes between dierent subcellular loca-
tions. Proceedings of the National Academy of Sci-
ences (Submitted).
Ridler, T., & Calvard, S. (1978). Picture thresholding
using an iterative selection method. Systems, Man
and Cybernetics, IEEE Transactions on, 8, 630{632.
Zhao, T., Velliste, M., Boland, M. V., & Murphy, R. F.
(2005). Object type recognition for automated anal-
ysis of protein subcellular location. IEEE trans-
actions on image processing : a publication of the
IEEE Signal Processing Society, 14, 1351{1359.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


