Sign up & Download
Sign in

Revisiting the Hubble sequence in the SDSS DR7 spectroscopic sample: a publicly available bayesian automated classification

by Marc Huertas-Company, J A L Aguerri, M Bernardi, S Mei, J Sánchez Almeida
Astronomy & Astrophysics (2010)

Abstract

We present an automated morphological classification in 4 types (E,S0,Sab,Scd) of ~700.000 galaxies from the SDSS DR7 spectroscopic sample based on support vector machines. The main new property of the classification is that we associate to each galaxy a probability of being in the four morphological classes instead of assigning a single class. The classification is therefore better adapted to nature where we expect a continuos transition between different morphological types. The algorithm is trained with a visual classification and then compared to several independent visual classifications including the Galaxy Zoo first release catalog. We find a very good correlation between the automated classification and classical visual ones. The compiled catalog is intended for use in different applications and can be downloaded at http://gepicom04.obspm.fr/sdssmorphology/Morphology2010.html and soon from the CasJobs database.

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Revisiting the Hubble sequence in the SDSS DR7 spectroscopic sample: a publicly available bayesian automated classification

ar
X
iv
:1
01
0.
30
18
v2
[
as
tro
-p
h.C
O]
2
9 O
ct
20
10
Astronomy & Astrophysics manuscript no. 15735 c© ESO 2010
November 1, 2010
Revisiting the Hubble sequence in the SDSS DR7 spectroscopic
sample: a publicly available Bayesian automated classification
M. Huertas-Company1,2, J.A.L Aguerri3, M. Bernardi4, S. Mei1,2, and J. Sa´nchez Almeida3
1 GEPI, Paris-Meudon Observatory 5, Place Jules Janssen, 92190, Meudon, France
e-mail: marc.huertas@obspm.fr
2 Universite´ Paris Diderot, 75205 Paris Cedex 13, France
3 Instituto de Astrofı´sica de Canarias, C/ Vı´a La´ctea s/n, 38200 La Laguna, Spain
4 Department of Physics & Astronomy, University of Pennsylvania, 209 S. 33rd St., Philadelphia, PA 19104, USA
Received September 15, 1996; accepted March 16, 1997
ABSTRACT
We present an automated morphological classification in 4 types (E,S0,Sab,Scd) of ∼ 700.000 galaxies from the SDSS DR7
spectroscopic sample based on support vector machines. The main new property of the classification is that we associate a
probability to each galaxy of being in the four morphological classes instead of assigning a single class. The classification
is therefore better adapted to nature where we expect a continuous transition between different morphological types. The
algorithm is trained with a visual classification and then compared to several independent visual classifications including the
Galaxy Zoo first-release catalog. We find a very good correlation between the automated classification and classical visual ones.
The compiled catalog is intended for use in different applications and is therefore freely available thorugh a dedicated webpage
⋆ and soon from the CasJobs database.
Key words. Catalogs, Astronomical databases, Galaxies:evolution, Galaxies:formation, Galaxies:fundamental parameters
1. Introduction
Classification of objects is a key step in understanding and
analyzing an astrophysical sample. In particular, morphol-
ogy is a powerful tracer of the structure of a galaxy. Since
Hubble’s first classification of galaxies according to their
shape (Hubble 1926), it has been shown that this phenomeno-
logical description hides important physical differences be-
tween galaxies and probably different evolutionary tracks.
Elliptical galaxies appear with old stellar populations, high
velocity dispersion, and small fraction of gas while spiral
galaxies are more gas-rich, with younger stellar populations
whose motion is rotation dominated.
The main problem with morphology comes from estima-
tion, since, even when done through visual inspection, there
are several intrinsic problems that can hardly be overcome.
First, when one goes at high redshift, several new galaxies
appear that do not necessarily fit in the Hubble fork (e.g.
Abraham et al. 1994; Abraham et al. 1996; Conselice et al.
2008; Delgado-Serrano et al. 2010), and secondly, everybody
who has looked at galaxies in detail has realized how difficult
it is to classify them by eye since there are lots of objects that
do not fall in a clear box (e.g. Postman et al. 2005). This be-
comes even worse when other parameters are included such
as colors or stellar dynamics. For example, Schawinski et al.
(2009) and Kannappan et al. (2009) have found a signifi-
cant fraction of elliptical galaxies with blue colors in the lo-
cal Universe. In the SAURON project (e.g. Emsellem et al.
⋆ http://gepicom04.obspm.fr/sdss\protect_morphology/Morphology\protect_2010.html
2007), one of the main conclusions is that a significant frac-
tion of morphologically defined early-type galaxies present
features similar to late-type ones, such as rotation in their
cores. The definition of an early or late type galaxy is con-
sequently not very clear. What defines a given galaxy type?
Is it just a shape and bulge fraction? or is it shape and stellar
populations? or is it stellar dynamics? Almost eighty years af-
ter Hubble’s definition, these questions remain unanswered.
It seems that, instead of being a closed definition, there
is more like a continuous population of galaxies with some
canonical objects, prototypes of elliptical, or spiral galaxies
and then some galaxies that are more or less close to the defi-
nition. Consequently, it makes more sense to assign distances
or probabilities of being in one of the canonical classes in-
stead of having a binary definition that is not necessarily very
close to reality.
In addition to these intrinsic issues, there are methodolog-
ical problems as well because morphological classifications
are, by definition, done by visual inspection. This job can be
done on small samples but becomes an impossible task in very
large surveys such as the SDSS, unless it is done through the
aggregated efforts of hundreds of thousands of people over
the course of many months as for the Galaxy Zoo project
(Lintott et al. 2008, 2010).
Lots of effort has been made to try to determine mor-
phology in an automated and simple way by measuring
some parameters, such as concentration, asymmetry, clumpi-
ness, Gini index (e.g. Abraham et al. 1996; Conselice et al.
2000; Abraham et al. 2 03) or through 1D (Prieto et al.
1
Page 2
hidden
M. Huertas-Company et al.: DR7 morphological classification
2001; Trujillo et al. 2001) or 2D-fitting algorithms (e.g.
Simard et al. 2002; Peng et al. 2002; de Souza et al. 2004;
Me´ndez-Abreu et al. 2008). More sophisticated classifica-
tions include colors and color gradients (e.g. Neichel et al.
2008) or use neural networks (e.g. Ball et al. 2004); however,
all these methods deal with a finite number of classes and/or at
some point require a degree of human intervention. Moreover,
one can still argue that automated classifications are not real
morphological classifications since we are just measuring pa-
rameters of the light distribution while morphology is a much
more complex pattern recognition problem.
In Huertas-Company et al. (2008, 2009b) we presented a
method based on support vector machines (galSVM). It was
initially designed for high-redshift galaxies, and it has the
advantage of dealing with an unlimited number of parame-
ters and assigning probabilities instead of binary classes. We
showed that, when applied to poorly resolved samples, it in-
creases the accuracy by a factor of ∼ 3, compared to more
classical methods. The method has already been used and
validated in a variety of different cases on space and ground-
based data to study, for instance, the fraction of blue early-
type galaxies in the field (Huertas-Company et al. 2010) and
the morphological mixing in clusters at intermediate redshift
(Huertas-Company et al. 2009a).
In this paper, we revisit the Hubble sequence in the SDSS
DR7 spectroscopic sample using this method and assign a
probability to each galaxy of being in the following morpho-
logical classes: E,S0,Sab,Scd, instead of a closed class. The
paper proceeds as follow. In section 2 we describe the sample
used, and in section 3 the method employed for the classi-
fication is presented in detail. We discuss the robustness of
the classification at the faint end in section 4 and a compar-
ison with a detailed visual classification of ∼14000 galaxies
(Nair & Abraham 2010) and with the Galaxy Zoo first release
catalog (Lintott et al. 2010) is shown in section 5. Finally, we
show some examples of how to use this catalog in section 6.
2. The sample
We used all the SDSS DR7 spectroscopic sample as the
starting base. Then, the selection of objects was based
on Sa´nchez Almeida et al. (2010) who performed an unsu-
pervised automated classification of all the SDSS spectra.
Basically, we chose galaxies with redshift below 0.25, and
with good photometric data and clean spectra, meaning ob-
jects not too close to the edges, not saturated, or not prop-
erly deblended. The final catalog contains 698420 objects for
which we estimate the morphology as shown below. No addi-
tional selection criteria were added so that the catalog is not
biased to any particular application.
3. The method
The classification method is based on support vector
machines (SVM) implemented in the libSVM library
(Chang & Lin 2001). SVM is a machine learning algorithm
that tries to find the optimal boundary (not necessarily lin-
ear) between several clouds of points in an N-dimensional
space. More information about the algorithm can be found in
Huertas-Company et al. (2008). There are several interesting
properties that make this algorithm attractive for galaxy clas-
sification. First, it can deal with an unlimited number of di-
mensions so that everything that is related to the classes one
would like to separate can be included in the classification
process. Second, it does not deliver a binary classification but
a probability of belonging to a given class. This probability
is related to the accuracy of the classification, the higher it is,
the higher the success rate (and so the closer are the objects
to the canonical classes), so that the accuracy of the classi-
fication can be studied in an objective way. This property is
lacking in most of the existing classification schemes (spe-
cially in the visual techniques).
3.1. Training sample
The SVM method needs a training sample, and all the be-
havior of the learning algorithm depends on how close this
training sample is to the real sample one wants to classify.
For morphological classification, the training sample is typ-
ically built using a visually classified subsample. The prob-
lem is that, usually, visual classifications are performed on the
brightest objects because it is obviously easier, and one would
like to go fainter in automated classifications. This causes
a mismatch in the properties of the galaxies in the training
sample and in the real sample, which can lead to misclassi-
fications. One solution, as shown in Huertas-Company et al.
(2008), is to simulate faint galaxies. In this paper, we decided
not to include any simulations to be able to use the param-
eters measured in the SDSS database so as be consistent in
the way parameters were measured in the training and real
samples. The effects of such (risky) decisions are carefully
studied in sections 4 and 5. We therefore used Fukugita et al.
(2007) classification as the training sample. In their paper,
they provide a visual classification of 2253 SDSS galaxies
brighter than mr = 16 (compared to the full DR7 sample,
which goes up to mr ∼ 18). Since our goal is to classify
galaxies in 4 main classes (E,S0,Sab,Scd), we group them ac-
cording to their morphological index T (Fukugita et al. 2007,
Table 1): E: T < 1, S0: T = 1, Sab: 2 < T < 4, and Scd:
4 ≤ T < 7 before using them for training the algorithm. We
included irregulars (T = 6) in the Scd class since there are
not enough objects in the local universe (and in particular in
the Fukugita et al. (2007) catalog) to make a separate class
for the training.
3.2. Procedure
SVM were originally thought to separate 2 classes. Some im-
plementations were done to add multi-class separation but the
accuracy is more difficult to assess. To avoid dealing with
multi class problems, in this paper we proceeded in two steps.
First we separated the sample in two main classes, i.e. early-
type galaxies, which includes ellipticals and S0 galaxies, and
late-type galaxies, which contain all the remaining morpho-
logical types from Sa to Scd/Im. Then we took the whole
sample and classified it again using 2 different training sets
that contain only early-type and late-type galaxies respec-
tively (see figure 1). The probability computed in this second
step can thus be seen as a conditional probability: “probabil-
ity of being S0 or E given that it is an early-type galaxy” and
“probability of being Sab or Scd given that it is a late-type
galaxy”. With this approach we were certain to have a broad
2
Page 3
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 1: Schematic view of the procedure used to classify the sample
and the probabilities measured in each step.
.
classification in two types (which is enough for lots of sci-
ence applications) with a high success rate, and then a more
detailed one. Each galaxy in the catalog is therefore associ-
ated with 6 probability values, i.e. the probability of being
in the two broad classes and the probability of being in the
4 subclasses. The 4 probabilities of the 4 subclasses can be
computed with the Bayes theorem using the conditional prob-
abilities:
P(E) = P(Early) × P(E/Early) (1)
P(S 0) = P(Early) × P(S 0/Early) (2)
P(S ab) = P(Late) × P(S ab/Late) (3)
P(S cd) = P(Late) × P(S cd/Late) (4)
We considered in the 4 equations above that
P(Early/E) = P(Early/S 0) = P(Late/S ab) =
P(Late/S cd) = 1. Following these equations, we obvi-
ously have P(Early) = P(E) + P(S 0) and P(Late) =
P(S ab) + P(S cd) and P(E)+P(S0)+P(Sab)+P(Scd)=1.
3.3. Parameters used
The SDSS database contains lots of photometric and spec-
troscopic parameters that are related to the morphological
type of the galaxy and could hence be used for the classi-
fication. One interesting property of SVM is that they are
not degenerate, in the sense that adding extra-parameters
does not lead to a decrease in the classification accuracy
(Huertas-Company et al. 2008) even if they do not bring any
extra information. However, the computing time increases
and the parameter space is less well sampled if too many pa-
rameters are included. After several tests, we decided to in-
clude three types of parameters: (1) color (g-r,r-i) k-corrected
with Blanton et al. (2005) code, (2) shape (isoB/isoA in the
i–band and deVAB i), and (3) light concentration (R90/R50
in the i–band). For color measurements we use model magni-
tudes corrected for galactic extinction. isoB and isoA are the
isophotal minor and major axes respectively, and deVAB i is
the DeVaucouleurs fit b/a. R90 and R50 are the radii contain-
ing 90% and 50% of the petrosian flux, respectively. Adding
more parameters does not significantly change the classifica-
tion and increases the execution time. The decision to include
the color could be discussed, since, as pointed out in the intro-
duction, it is not clear how an early-type or a late-type galaxy
is actually defined. Since our approach is to define classes
as closely as possible to the canonical definition and then
compute distances to them, it makes sense to include color.
Indeed, for an elliptical to be elliptical it should be red, oth-
erwise it should be called blue elliptical, and it is an excep-
tion to the normal classification. Eitherway, tests performed
reveal that removing the color from the parameter space does
not significantly change the classification. Fewer than 10% of
the galaxies change their main morphological class. In fig-
ure 2, we show the 4 probabilities as a function of some rep-
resentative parameters used in the classification. We observe
some obvious correlations: i.e. the probability of being ellip-
tical increases with concentration, and redder galaxies have
higher probabilities of being ellipticals. The correlations are
less clear for intermediate classes (S0 and Sab). One impor-
tant conclusion by looking at these plots is that one single pa-
rameter is not enough to select galaxies with high probability
of being in a given class. For instance, it is common to use a
concentration threshold R90/R50 > 2.6 (in the r-band) to se-
lect elliptical galaxies (e.g. Bell et al. 2003; Kauffmann et al.
2003). As shown in the top panel of figure 2 this selection re-
sults in a significant fraction of galaxies with low probabilities
of being elliptical galaxies (as also shown in Bernardi et al.
2010b).
4. Robustness
4.1. Accuracy at the faint end
As pointed out in section 3, there is a critical point in our
approach, since the classified sample contains lots of galaxies
fainter than the limiting magnitude of the training sample.
Therefore, it is very important to check that these faint galax-
ies are not systematically misclassified just because they are
not represented in the training. As a first check, we computed
the probability distributions of bright (mg < 16) and faint
galaxies (mg > 16) in figure 3 to check that faint galaxies
are systematically classified with lower probabilities. As
shown in Huertas-Company et al. (2008), the probability
is a kind of measure of how good the classification is and
how close a given galaxy is to the corresponding associated
class. Low probabilities in all the classes consequently
mean that the galaxy is not close to any of the classes of
the training, which would mean that faint galaxies are not
properly classified because they are not properly sampled
in the training set. We observe in figure 3 that there is no
evident difference between both probability distributions.
A Kolmogorov-Smirnoff test gives between 99% and 55%
probability that the 2 distributions are drawn from the same
distribution, so the possibility that the 2 distributions are
decoupled is rejected. The probability values seem to be
quite independent of the galaxy brightness, at least up to the
magnitude limit of the sample. The algorithm is thus able to
find a clear, closest class even for the faintest objects, which
supports the robustness of the classification.
As a second check, we looked at some of the images of
the faint end of the sample (Fig. 4). We confirm that high-
probability values for a given morphological class still corre-
spond to galaxies that closely look like galaxies in this given
class independently of the magnitude. It therefore seems that
the classification is robust even for the faintest objects in the
sample and that no major misclassifications are evident. In
3
Page 4
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 2: Distribution of the main parameters used (concentration (r90/r50), axis ratio (b/a), and color (g-r)) as a function of probability. All
parameters are measured in the i band (see text).
.
section 5 we perform a detailed comparison with a visual clas-
sification of faint objects.
4.2. Dependence on the training set
Another important point that should be studied is the effect
of changes in the training set on the final classification. In
fact, a robust classification should not change significantly if
some elements are removed from the training sample. On the
contrary, if removing some elements leads to a completely
different classification, it means that the parameter space is
not properly sampled and therefore the classification is very
unstable. To check this point, we performed 10 different clas-
sifications with slightly different training sets. The samples
were generated by randomly selecting a subset of 500 galax-
ies from the Fukugita et al. (2007) sample. We then compared
the different classifications in terms of probability. These 10
runs on the full data set take only a few minutes on a normal
laptop.The average scatter over the 10 runs of the probabil-
ity of being early-type (or late-type) is 12%. In other words,
when one changes the training set, the probability for a given
galaxy changes ∼ 12% on average. This 12% scatter is com-
4
Page 5
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 4: Examples of galaxies with their computed probability values
.
patible and even less than the typical scatter found when sev-
eral people perform visual classifications on the same sample
(e.g.Postman et al. 2005; Fukugita et al. 2007).
4.3. Uncertain objects
Another way of assessing the robustness of the classification
is by measuring the fraction of objects whose classification
is uncertain. If this fraction appears to be too high it would
imply that the algorithm is not working for a large fraction of
the sample. We define uncertain objects as those for which
the difference between the maximum and the minimum prob-
ability value is less than 0.15; i.e. the four probabilities are in
a range less than 0.15, so the galaxy does not clearly fit in any
of the four morphological classes.
There are 3013 objects verifying this condition, 0.4% of
the whole sample. The vast majority of the objects are there-
fore close to one (or two) morphological classes and very few
are in an uncertain region. A visual inspection of these galax-
ies (fig. 5) reveals that they are small, compact, and/or dis-
turbed objects, for which the visual morphology is also dif-
ficult to assess. They are not, however, particularly distant or
faint objects since the magnitude and redshift distributions are
compatible with the ones of the full sample.
5. Comparison with visual classifications
5.1. Comparison with Nair & Abraham 2010
One obvious validation check of the classification is to com-
pare it with existing visual classifications. As explained in
previous sections, we used the Fukugita et al. (2007) catalog
for training. It is therefore better to use a different indepen-
dent subsample for testing the accuracy and robustness of the
classification. In a recent paper, Nair & Abraham (2010) pub-
lished a very detailed visual catalog of 14034 galaxies in the
SDSS with mg < 16. Galaxies in this sample are included
in our classification, but most of them have not been used
to build our training sample so they represent an ideal in-
dependent cross check. Since Nair & Abraham (2010) clas-
sification is much more detailed than ours, we group their
classes into 4 groups matching the 4 classes we have de-
fined in this work. We consider elliptical galaxies objects with
TType = −5, S0s, TType = −2, Sabs, 1 ≤ TType ≤ 3, and
finally Scd, 5 ≤ TType ≤ 10 (see table 1 of Nair & Abraham
2010 for a definition of the TType index used in their work).
5
Page 6
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 3: Probability distributions of bright (mg < 16, red dotted line)
and faint (mg > 16, black solid line) galaxies in the sample. The 4
panels show the 4 computed probabilities as indicated in the x-axis
labels.
.
Fig. 5: Examples of uncertain classifications as defined in the text.
.
Figure 6 shows the probability distributions of these 4 groups.
Globally, we observe a good correlation between the proba-
bility values and the visual class. For example, galaxies visu-
ally classified as ellipticals have on average a probability of
∼ 0.8 of being ellipticals and ∼ 0.2 of being S0. The two other
probabilities are almost zero. Traditionally, it is well known
that it is very difficult to separate S0 galaxies by eye. This
is reflected in the probability distributions which are more
uniform than for the pure elliptical class. A galaxy visually
classified as S0 has on average ∼ 0.4 probability of being
S0 but also ∼ 0.32 of being elliptical and 0.2 of being Sab,
which reflects the difficulty of defining the S0 class and the
fact that these galaxies are indeed a transition class in terms
of morphology between the ellipticals and the spirals. A sim-
ilar effect is seen in the Sab population which has on average
a probability of ∼ 0.55 of being Sab but also ∼ 0.15 of being
S0 or Scd. Another interesting measurement is the fraction
of catastrophic classifications, i.e. galaxies whose automated
and visual classes are completely different. We define those
cases as objects for which P(E) > 0.8 and TType > 5 or
P(S cd) > 0.8 and TType = −5, i.e. galaxies that are clearly
elliptical (Scd) for our algorithm and visually classified as Sc
or later (elliptical). There are only 2 objects verifying these
conditions, and both are in the first case. They are indeed spi-
ral galaxies, so the algorithm is wrong, but both have a large
red bulge, which can probably account for the misclassifica-
tion.
5.2. Galaxy Zoo
Recently, the Galaxy Zoo1 team (Lintott et al. 2010) has made
publicly available the visual classification of the full DR7 per-
formed through the aggregated efforts of hundreds of thou-
sands of people over the course of many months. This work
is an extraordinary effort (and probably the only way) to visu-
ally classify present and future extremely large surveys. The
main drawback, however, is that it requires plenty of time
(more than 2 years in this case) to collect all the informa-
tion and put all the catalogs in place. It is therefore a very
interesting question to see how our automated classification
behaves compared to this visual classification. Our classifica-
tion is indeed much faster and can be run several times with
different parameters in just a few minutes, but it is not obvi-
ous whether we can reach an accuracy similar to the human
brain. Moreover, this comparison also enables the comparison
for the faint end of the sample (since the GalaxyZoo catalog
contains all galaxies), hence a new evaluation of the effect of
lacking faint objects in the training sample (see section 4).
The classification made in the framework of the GalaxyZoo
is less detailed than a pure visual classification, such as the
one from Nair & Abraham (2010) or Fukugita et al. (2007);
i.e, they basically asked people if the galaxy is elliptical
like (which should include S0s) or spiral like (with different
subcategories like clockwise or anti-clockwise rotation), but
without submorphological types. Galaxy Zoo 22 and Hubble
Zoo3 will furnish more detailed classifications in the coming
future but are not publicly available for the moment. The con-
fidence of the classification in the current release is measured
by the fraction of votes received, since each galaxy is clas-
sified by several persons. A galaxy is then flagged as early-
type or spiral-like if the fraction of votes in one of those cat-
egories is greater than 80%. In figure 7 we show the prob-
ability distribution obtained with the galSVM classification
for galaxies flagged as elliptical like (flag ELLIPTICAL = 1)
and spiral like (flag SPIRAL = 1), respectively. We observe an
extremely good correlation between both classifications even
for faint galaxies not necessarily well represented in the train-
ing set as discussed in § 4. Galaxies flagged as ellipticals in
the Galaxy Zoo catalog have a median probability of 0.92 of
being elliptical or S0 and the same for galaxies classified as
spirals. This means that robust classifications in Galaxy Zoo
are also very sure classifications in our catalog; however the
1 http://galaxyzoo.org/
2 http://zoo2.galaxyzoo.org/
3 http://hubble.galaxyzoo.org/
6
Page 7
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 6: Probability distributions of the 4 morphological types considered in this work for 4 visual types (TType) from Nair & Abraham
(2010). Each different panel shows a different visual type. Top left panel shows galaxies with T T ype = −5 (Ellpticals); top right panel:
T T ype = −2 (S0s); bottom left panel: 1 ≤ T T ype ≤ 3 (Sabs) and finally bottom right panel: 5 ≤ T T ype ≤ 10 (Scd). Red short dashed lines
are P(E), orange dashed dotted lines are P(S0), green dashed three dotted lines are P(Sab), and blue long dashed lines are P(Scd). See text
for details of how these 4 probabilities are computed.
.
fraction of galaxies without a clear morphological type (i.e.
the fraction of votes is less than 80% so they lie somewhere
between a pure early-type or late-type galaxy) in the Galaxy
Zoo is relatively high (∼ 60%), so it is interesting to check
where all these remaining galaxies fall.
For that purpose, we push the comparison a bit further. As
a matter of fact, since the quality of the classification in the
Galaxy Zoo is measured by the number of votes, another in-
teresting test is to compare our probability measurement to
the fraction of votes. In other words: does the probability
measurement reflect the choice of the majority? We indeed
expect to find a correlation, since certain classifications in
terms of votes should also be galaxies close to the canon-
ical definition, hence objects with high probability values.
This comparison is shown in fig. 8. There issignificant scat-
ter, but we observe 2 clear clouds. Objects with a high fraction
7
Page 8
hidden
M. Huertas-Company et al.: DR7 morphological classification
of votes for being elliptical have high probability values and
vice-versa. The same behavior is measured for spirals. When
we average the fraction of votes per probability bin, the cor-
relation becomes clearer, and we find that there is a mono-
tonic relation between the fraction of votes and the probabil-
ity (Fig. 8). This fact confirms that our probability measure-
ment indeed measures the robustness of the classification for
a given object.
In figures 9 and 10 we compare the fraction of votes with
the 4 more detailed probabilities computed in this work
(P(E),P(S0),P(Sab), and P(Scd)). We again find a clear cor-
relation between the number of votes given by people and the
probability computed in an automated way by galSVM.
6. How to use the catalog?
The most important new point of the classification presented
in this work is the measurement of probabilities. Therefore, a
morphological class is not defined as a closed box, but there
is more like a continuous transition from one class to another.
How can this new property can be used for selecting a par-
ticular population and studying its properties? If one wants
to perform luminosity or mass functions for a given morpho-
logical type, the optimal way (in terms of optimal estimation)
is to make use of the probability measure as a weight for the
galaxy counts. As shown in Huertas-Company et al. (2009b),
we can define a random variable Yk:
Yk =
{
0 with a probability 1 − PType
1 with a probability PType
. This way, the number of galaxies of a given morphological
type in a mass or luminosity bin is simply given by its math-
ematical expectation,
NType =

Nob j
PType, (5)
and the 1 − σ error is the square root of the variance:
σType =

Nob j
PType × (1 − PType). (6)
All the galaxies contribute to the mass function of a given
morphological type weighted by its probability. As a result, a
galaxy that is 95% Sd and 0.5% E will still contribute to the
mass function of elliptical galaxies with a weight of 0.005.
Another approach is to make probability cuts. This way,
we decide that galaxies belong to a given class by applying
a probability threshold.This approach (even if not optimal)
should be closer to the classical approach from visual classi-
fications in which galaxies only contribute in one given class.
The threshold to apply depends on the application. For ex-
ample, it is interesting to determine which threshold is the
best to get similar distributions than with visual classifica-
tions. In figure 12, we compare the two estimations of the
observed distribution of stellar masses with the ones obtained
from the visual classification of Nair & Abraham (2010). We
use a threshold of PTtype > 0.45 in each type and obtain sim-
ilar distributions for all morphological types. Stellar masses
are taken from the Nair & Abraham (2010) catalog, also taken
from Kauffmann et al. (2003) estimates.
In figure 11 we show the observed distribution of stel-
lar masses for the whole sample for different morphologi-
cal types using the probability estimator. In this case, stel-
lar masses are computed with the Bell et al. (2003) formula,
adapted from Bernardi et al. (2010b) to account for evolution:
log10(MBell∗ /M⊙) = 1.097(g−r)−0.406−0.4(Mr−4.67)−0.19z.(7)
We observe the expected trend; i.e, the mass function peaks
at lower values for later morphological types. In the same fig-
ure, we compare the distribution of masses obtained from the
Galaxy Zoo classification. We compare the one obtained with
galaxies flagged as ellipticals (FLAG ELLIPTICAL = 1) with
the one obtained using the two estimators described above,
i.e. galaxies having p(E) > 0.5 and probability weighting.
The same is computed for spirals. There is almost a perfect
match with the distributions computed using galSVM, which
again confirms the accuracy of the automated classification
presented in this paper.
Another common application is to study the color-stellar
mass diagrams for different ”robust” morphological types.
Again, the probability estimator can be used by comput-
ing the 2D histogram of galaxies in the color-mass plane
weighted with the probabilities. Figure 13 shows the prob-
ability contours in the color-stellar mass plane for the 4 mor-
phological types. We observe the expected trend: elliptical
and S0 galaxies are redder with less scatter, while Sab and
Scd are bluer. An interesting feature of Sab galaxies (and
for some Scd) is that there seems to be 2 distinct popula-
tions: one red population and another one lying in the so-
called green valley between the blue cloud and the red se-
quence . After careful visual inspection of an important frac-
tion of these red galaxies, we can confirm that for most of
them they are in fact edge-on spirals probably reddened by
dust. A small fraction are, however, real passive spirals as
shown and carefully studied by Masters et al. (2010a,b). Most
of them are classified as Sab galaxies with high probability
(see figure 4). This result confirms that a pure color selection
is not enough to select ellipticals or S0 galaxies since it is
highly polluted by edge-on spirals as already shown in pre-
vious works (e.g. Schawinski et al. 2007; Lintott et al. 2008;
Bernardi et al. 2010b).
These plots are just shown here to validate the morpho-
logical classification. A more detailed analysis of the funda-
mental parameters of galaxies is expected to come in future
dedicated papers.
7. Summary and conclusions
We have presented an automated morphological classification
of the SDSS DR7 spectroscopic sample. The algorithm used
is based on SVM, and the most interesting and new property
is that it associates a probability value to each galaxy instead
of a single class. This way, the transition between one class
and another is continuous, which should be a better approx-
imation to nature and to visual classifications. As a matter
of fact, when the brain decides which morphological class is
closer to a given object we are looking at, it probably also
implicitly measures some parameters and computes distances
in this virtual parameter space to decide which one is the
closest canonical class to the object it is classifying. In that
sense, even if the list of parameters we measure is reduced
8
Page 9
hidden
M. Huertas-Company et al.: DR7 morphological classification
9
Page 10
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 8: Probability values computed with galSVM compared to the fraction of votes for ellipticals (left pannel) and spirals (right pannel).
Gray scales are scaled to the data; i.e. white is maximum and black is minimum. Lines show the average fraction of votes in 0.05 probability
bins
.
and much more simplistic than what our brain can do (e.g we
are not including spiral arms nor tidal features that certainly
play an important role in a visual classification), the spirit
of our approach is closer to a classical visual classification
than other existing automated methods. The results obtained
are in good agreement with existing visual classifications
and are robust even at the faint end of the sample. The main
advantage of this approach is that it is fast (a few minutes
on a regular laptop) and reproducible. Moreover, we obtain
a classification into 4 morphological types instead of the 2
obtained in the Galaxy Zoo. The probability measurements
can be used as a weighting factor for computing statistical
quantities, such as luminosity or mass functions, or as
a selection criterion to be sure that a cleaned sample of
galaxies is selected. The classification is intended for use in
many different applications and is therefore freely available at
http://gepicom04.obspm.fr/sdss_morphology/Morphology_2010.html
and soon from the CasJobs database. In subsequent papers,
the classification will be used to compare spectroscopic
and morphological classifications and investigate possible
transitions in color-mass space (Sanchez-Almeida et al. in
preparation) and to study the morphological properties of
galaxies around BCGs (Bernardi et al. in preparation).
Acknowledgements. The authors are grateful to F. Hammer for reading the
manuscript and providing interesting input.
References
Abraham, R., van den Bergh, S., Glazebrook, K., et al. 1996, ApJ
Supplement, 107, 1
Abraham, R. G., Valdes, F., Yee, H. K. C., & van den Bergh, S. 1994, ApJ,
432, 75
Abraham, R. G., van den Bergh, S., & Nair, P. 2003, ApJ, 588, 218
Ball, N. M., Loveday, J., Fukugita, M., et al. 2004, MNRAS, 348, 1038
Bell, E. F., McIntosh, D. H., Katz, N., & Weinberg, M. D. 2003, ApJS, 149,
289
Bernardi, M., Roche, N., Shankar, F., & Sheth, R. K. 2010a, ArXiv e-prints
Bernardi, M., Shankar, F., Hyde, J. B., et al. 2010b, MNRAS, 404, 2087
Blanton, M. R., Schlegel, D. J., Strauss, M. A., et al. 2005, AJ, 129, 2562
Chang, C.-C. & Lin, C.-J. 2001, LIBSVM: a library
for support vector machines, software available at
http://www.csie.ntu.edu.tw/˜cjlin/libsvm
Conselice, C. J., Bershady, M. A., & Jangren, A. 2000, ApJ, 529, 886
Conselice, C. J., Rajgor, S., & Myers, R. 2008, MNRAS, 386, 909
de Souza, R. E., Gadotti, D. A., & dos Anjos, S. 2004, ApJS, 153, 411
Delgado-Serrano, R., Hammer, F., Yang, Y. B., et al. 2010, A&A, 509, A78+
Emsellem, E., Cappellari, M., Krajnovic´, D., et al. 2007, MNRAS, 379, 401
Fukugita, M., Nakamura, O., Okamura, S., et al. 2007, AJ, 134, 579
Hubble, E. P. 1926, Astrophys. J., 64, 321
Huertas-Company, M., Aguerri, J. A. L., Tresse, L., et al. 2010, A&A, 515,
A3+
Huertas-Company, M., Foex, G., Soucail, G., & Pello´, R. 2009a, A&A, 505,
83
Huertas-Company, M., Rouan, D., Tasca, L., Soucail, G., & Le Fe`vre, O.
2008, A&A, 478, 971
Huertas-Company, M., Tasca, L., Rouan, D., et al. 2009b, A&A, 497, 743
Kannappan, S. J., Guie, J. M., & Baker, A. J. 2009, ArXiv e-prints
Kauffmann, G., Heckman, T. M., White, S. D. M., et al. 2003, MNRAS, 341,
33
Lintott, C., Schawinski, K., Bamford, S., et al. 2010, ArXiv e-prints
Lintott, C. J., Schawinski, K., Slosar, A., et al. 2008, MNRAS, 389, 1179
Masters, K. L., Mosleh, M., Romer, A. K., et al. 2010a, MNRAS, 405, 783
Masters, K. L., Nichol, R., Bamford, S., et al. 2010b, MNRAS, 404, 792
Me´ndez-Abreu, J., Aguerri, J. A. L., Corsini, E. M., & Simonneau, E. 2008,
A&A, 487, 555
Nair, P. B. & Abraham, R. G. 2010, ApJS, 186, 427
Neichel, B., Hammer, F., Puech, M., et al. 2008, A&A, 484, 159
Peng, C. Y., Ho, L. C., Impey, C. D., & Rix, H.-W. 2002, AJ, 124, 266
Postman, M., Franx, M., Cross, N. J. G., et al. 2005, ApJ, 623, 721
Prieto, M., Aguerri, J. A. L., Varela, A. M., & Mun˜oz-Tun˜o´n, C. 2001, A&A,
367, 405
Sa´nchez Almeida, J., Aguerri, J. A. L., Mun˜oz-Tun˜o´n, C., & de Vicente, A.
2010, ApJ, 714, 487
10
Page 11
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 9: Comparison between the fraction of votes for a galaxy to be ellipical like from Galaxy Zoo and the computed probabilities in this
work. Gray scales are scaled to the data; i.e. white is maximum and black is minimum. Solid line shows the average relation. The average is
computed in 0.05 probability bins.
.
Schawinski, K., Lintott, C., Thomas, D., et al. 2009, MNRAS, 396, 818
Schawinski, K., Thomas, D., Sarzi, M., et al. 2007, MNRAS, 382, 1415
Simard, L., Willmer, C. N. A., Vogt, N. P., et al. 2002, ApJS, 142, 1
Trujillo, I., Aguerri, J. A. L., Cepa, J., & Gutie´rrez, C. M. 2001, MNRAS,
321, 269
Appendix A: Catalog
11
Page 12
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 10: Comparison between the fraction of votes for a galaxy to be spiral like from Galaxy Zoo and the computed probabilities in this
work. Gray scales are scaled to the data, i.e. white is maximum and black is minimum. Solid line shows the average relation. The average is
computed in 0.05 probability bins.
.
12
Page 13
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 11: Observed distribution of masses for different morphological types computed using different estimators described in the text (see text
for details). In the left panel the whole sample is shown using the probability weighting. Red short dashed line: ellipticals; yellow dashed
dotted line: S0s; green dashed three dotted line: Sabs; blue long dashed line: Scds. In the right panel, we show galaxies flagged as SPIRAL
and ELLIPTICAL in the galaxy zoo. Red solid lines are galaxies flagged as ellipticals in Galaxy Zoo (FLAG ELLIPTICAL = 1), red dashed
line is the distribution obtained using probability weighting and red dots are galaxies with p(E) > 0.5. Blue solid lines are galaxies flagged
as spirals ( FLAG SPIRAL = 1) in the Galaxy Zoo, blue dashed line is the distribution obtained using probability weighting and blue dots
are galaxies with p(S ab) + P(S cd) > 0.5.
.
Table A.1: First 10 objects in the catalog. Columns are: id, identification number, SpecObjId, id from the SDSS spectroscopic catalog, RA,
right ascension, DEC: declination, z, redshift from the SDSS database, p(Early), probability of being early-type (E or S0), p(E), probability
of being elliptical, p(S0), probability of being S0, p(Sab), probability of being Sa or Sb, p(Scd), probability of being Sc or Sd and ask class,
the spectral class from Sa´nchez Almeida et al. (2010)
id SpecObjId RA DEC z
p(Early) p(E) p(S0) p(Sab) p(Scd) ask class
1 7509409297491... 146.7441406 -0.6522176 0.203
0.941 0.790 0.150 0.032 0.026 2.0
2 7509409298330... 146.6285706 -0.7651463 0.064
0.145 0.023 0.121 0.641 0.213 0.0
3 7509409301266... 146.9341278 -0.670413 0.121
0.969 0.861 0.108 0.016 0.013 0.0
4 7509409301685... 146.9638977 -0.5450143 0.056
0.061 0.011 0.049 0.440 0.498 10.0
5 7509409302105... 146.9635162 -0.7593367 0.09
0.802 0.169 0.632 0.135 0.062 3.0
6 7509409302524... 146.9499969 -0.5922154 0.064
0.120 0.020 0.100 0.762 0.116 10.0
7 7509409303363... 146.8598328 -0.8089029 0.126
0.834 0.038 0.796 0.089 0.076 1.0
8 7509409303783... 146.5927277 -0.7602585 0.064
0.188 0.026 0.161 0.618 0.193 9.0
9 7509409304202... 146.8576965 -0.6628734 0.084
0.004 0.001 0.003 0.451 0.543 9.0
10 7509409304621... 146.727951 -0.5568492 0.089
0.939 0.721 0.217 0.031 0.029 0.0
13
Page 14
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 12: Observed distribution of masses for different morphological types in the Nair & Abraham (2010) sample using different estimators
(see tex for details). Black solid lines: visual classification; red filled circles: probability cuts; red dashed line: probability estimates. Each
panel shows a visual morphological class from Nair & Abraham (2010), selected as described in the text. For the probability cuts, we use
P > 0.45 in this given type.
.
14
Page 15
hidden
M. Huertas-Company et al.: DR7 morphological classification
Fig. 13: Color magnitude relation for the 4 morphological types. Contours are computed by probability weighting. For reference, we show
in the 4 panels the best fit to the elliptical red sequence from Bernardi et al. (2010a). Top left panel: Ellipticals, Top right panel: S0s, bottom
left panel: Sab galaxies, bottom right panel: Scd galaxies.
.
15

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

10 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
30% Post Doc
 
20% Professor
 
20% Researcher (at an Academic Institution)
by Country
 
20% United States
 
20% Canada
 
10% United Kingdom

Groups

Galaxy Zoo