Sign up & Download
Sign in

Supervised Feature Selection via Dependence Estimation

by Le Song, Alex Smola, Arthur Gretton, Karsten Borgwardt, Justin Bedo
Proceedings of the 24th International Conference on Machine Learning (2007) (2007)

Abstract

We introduce a framework for filtering features that employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between the features and the labels. The key idea is that good features should maximise such dependence. Feature selection for various supervised learning problems (including classification and regression) is unified under this framework, and the solutions can be approximated using a backward-elimination algorithm. We demonstrate the usefulness of our method on both artificial and real world datasets.

Cite this document (BETA)

Available from discovery.ucl.ac.uk
Page 1
hidden

Supervised Feature Selection via Dependence Estimation

Supervised Feature Selection via Dependence Estimation
Le Song lesong@it.usyd.edu.au
NICTA, Statistical Machine Learning Program, Canberra, ACT 0200, Australia; and University of Sydney
Alex Smola alex.smola@gmail.com
NICTA, Statistical Machine Learning Program, Canberra, ACT 0200, Australia; and ANU
Arthur Gretton arthur.gretton@tuebingen.mpg.de
MPI for Biological Cybernetics, Spemannstr. 38, 72076 Tubingen, Germany
Karsten Borgwardt borgwardt@dbs.ifi.lmu.de
LMU, Department "Institute for Informatics", Oettingenstr. 67, 80538 Munchen, Germany
Justin Bedo bedo@ieee.org
NICTA, Statistical Machine Learning Program, Canberra, ACT 0200, Australia
Abstract
We introduce a framework for ltering fea-
tures that employs the Hilbert-Schmidt In-
dependence Criterion (HSIC) as a measure
of dependence between the features and the
labels. The key idea is that good features
should maximise such dependence. Fea-
ture selection for various supervised learning
problems (including classi cation and regres-
sion) is uni ed under this framework, and
the solutions can be approximated using a
backward-elimination algorithm. We demon-
strate the usefulness of our method on both
arti cial and real world datasets.
1 Introduction
In supervised learning problems, we are typically given
m data points x 2 X and their labels y 2 Y. The
task is to nd a functional dependence between x and
y, f : x 7! y, subject to certain optimality condi-
tions. Representative tasks include binary classi ca-
tion, multi-class classi cation, regression and ranking.
We often want to reduce the dimension of the data (the
number of features) before the actual learning (Guyon
& Elissee , 2003); a larger number of features can be
associated with higher data collection cost, more dif-
culty in model interpretation, higher computational
cost for the classi er, and decreased generalisation
Appearing in Proceedings of the 24 th International Confer-
ence on Machine Learning, Corvallis, OR, 2007. Copyright
2007 by the author(s)/owner(s).
ability. It is therefore important to select an infor-
mative feature subset.
The problem of supervised feature selection can be
cast as a combinatorial optimisation problem. We
have a full set of features, denoted S (whose elements
correspond to the dimensions of the data). We use
these features to predict a particular outcome, for
instance the presence of cancer: clearly, only a subset
T of features will be relevant. Suppose the relevance
of T to the outcome is quanti ed by Q(T ), and
is computed by restricting the data to the dimen-
sions in T . Feature selection can then be formulated as
T 0 = arg max
T S
Q(T ) subject to j T j  t; (1)
where j  j computes the cardinality of a set and t up-
per bounds the number of selected features. Two im-
portant aspects of problem (1) are the choice of the
criterion Q(T ) and the selection algorithm.
Feature Selection Criterion. The choice of Q(T )
should respect the underlying supervised learning
tasks | estimate dependence function f from train-
ing data and guarantee f predicts well on test data.
Therefore, good criteria should satisfy two conditions:
I: Q(T ) is capable of detecting any desired (nonlin-
ear as well as linear) functional dependence be-
tween the data and labels.
II: Q(T ) is concentrated with respect to the under-
lying measure. This guarantees with high proba-
bility that the detected functional dependence is
preserved in the test data.
ar
X
iv
:0
70
4.
26
68
v1
[
cs
.L
G]
2
0 A
pr
20
07
Page 2
hidden
Supervised Feature Selection via Dependence Estimation
While many feature selection criteria have been ex-
plored, few take these two conditions explicitly into
account. Examples include the leave-one-out error
bound of SVM (Weston et al., 2000) and the mu-
tual information (Koller & Sahami, 1996). Although
the latter has good theoretical justi cation, it requires
density estimation, which is problematic for high di-
mensional and continuous variables. We sidestep
these problems by employing a mutual-information
like quantity | the Hilbert Schmidt Independence
Criterion (HSIC) (Gretton et al., 2005). HSIC uses
kernels for measuring dependence and does not require
density estimation. HSIC also has good uniform con-
vergence guarantees. As we show in section 2, HSIC
satis es conditions I and II, required for Q(T ).
Feature Selection Algorithm. Finding a global
optimum for (1) is in general NP-hard (Weston et al.,
2003). Many algorithms transform (1) into a continu-
ous problem by introducing weights on the dimensions
(Weston et al., 2000, 2003). These methods perform
well for linearly separable problems. For nonlinear
problems, however, the optimisation usually becomes
non-convex and a local optimum does not necessarily
provide good features. Greedy approaches { forward
selection and backward elimination { are often used to
tackle problem (1) directly. Forward selection tries to
increase Q(T ) as much as possible for each inclusion of
features, and backward elimination tries to achieve this
for each deletion of features (Guyon et al., 2002). Al-
though forward selection is computationally more ef-
cient, backward elimination provides better features
in general since the features are assessed within the
context of all others.
BAHSIC. In principle, HSIC can be employed using
either the forwards or backwards strategy, or a mix of
strategies. However, in this paper, we will focus on
a backward elimination algorithm. Our experiments
show that backward elimination outperforms forward
selection for HSIC. Backward elimination using HSIC
(BAHSIC) is a lter method for feature selection. It
selects features independent of a particular classi er.
Such decoupling not only facilitates subsequent feature
interpretation but also speeds up the computation over
wrapper and embedded methods.
Furthermore, BAHSIC is directly applicable to binary,
multiclass, and regression problems. Most other fea-
ture selection methods are only formulated either for
binary classi cation or regression. The multi-class ex-
tension of these methods is usually accomplished us-
ing a one-versus-the-rest strategy. Still fewer methods
handle classi cation and regression cases at the same
time. BAHSIC, on the other hand, accommodates all
these cases in a principled way: by choosing di erent
kernels, BAHSIC also subsumes many existing meth-
ods as special cases. The versatility of BAHSIC origi-
nates from the generality of HSIC. Therefore, we begin
our exposition with an introduction of HSIC.
2 Measures of Dependence
We de ne X and Y broadly as two domains from which
we draw samples (x; y): these may be real valued, vec-
tor valued, class labels, strings, graphs, and so on. We
de ne a (possibly nonlinear) mapping (x) 2 F from
each x 2 X to a feature space F , such that the in-
ner product between the features is given by a kernel
function k(x; x0) := h(x); (x0)i: F is called a repro-
ducing kernel Hilbert space (RKHS). Likewise, let G
be a second RKHS on Y with kernel l(; ) and feature
map (y). We may now de ne a cross-covariance op-
erator between these feature maps, in accordance with
Baker (1973); Fukumizu et al. (2004): this is a linear
operator Cxy : G 7! F such that
Cxy = Exy[((x) x)
( (y) y)]; (2)
where
is the tensor product. The square of the
Hilbert-Schmidt norm of the cross-covariance operator
(HSIC), k Cxy k2HS, is then used as our feature selection
criterion Q(T ). Gretton et al. (2005) show that HSIC
can be expressed in terms of kernels as
HSIC(F ;G;Pr
xy
) = k Cxy k
2
HS (3)
= Exx0yy0 [k(x; x0)l(y; y0)] + Exx0 [k(x; x0)]Eyy0 [l(y; y0)]
2Exy[Ex0 [k(x; x0)]Ey0 [l(y; y0)]];
where Exx0yy0 is the expectation over both (x; y) 
Prxy and an additional pair of variables (x0; y0)  Prxy
drawn independently according to the same law. Pre-
vious work used HSIC to measure independence be-
tween two sets of random variables (Gretton et al.,
2005). Here we use it to select a subset T from the
rst full set of random variables S. We now describe
further properties of HSIC which support its use as a
feature selection criterion.
Property (I) Gretton et al. (2005, Theorem 4) show
that whenever F ;G are RKHSs with universal kernels
k; l on respective compact domains X and Y in the
sense of Steinwart (2002), then HSIC(F ;G;Prxy) = 0
if and only if x and y are independent. In terms of
feature selection, a universal kernel such as the Gaus-
sian RBF kernel or the Laplace kernel permits HSIC
to detect any dependence between X and Y. HSIC is
zero if and only if features and labels are independent.
In fact, non-universal kernels can also be used for
HSIC, although they may not guarantee that all de-
Page 3
hidden
Supervised Feature Selection via Dependence Estimation
pendencies are detected. Di erent kernels incorporate
distinctive prior knowledge into the dependence esti-
mation, and they focus HSIC on dependence of a cer-
tain type. For instance, a linear kernel requires HSIC
to seek only second order dependence. Clearly HSIC is
capable of nding and exploiting dependence of a much
more general nature by kernels on graphs, strings, or
other discrete domains.
Property (II) Given a sample Z =
f(x1; y1); : : : ; (xm; ym)g of size m drawn from
Prxy, we derive an unbiased estimate of HSIC,
HSIC(F ;G; Z) (4)
= 1m(m3) [tr(KL) +
1>K11> L1
(m1)(m2)
2
m2 1
>K L 1];
where K and L are computed as Kij = (1
ij)k(xi; xj) and Lij = (1 ij)l(yi; yj). Note that
the diagonal entries of K and L are set to zero. The
following theorem, a formal statement that the empir-
ical HSIC is unbiased, is proved in the appendix.
Theorem 1 (HSIC is Unbiased) Let EZ denote
the expectation taken over m independent observations
(xi; yi) drawn from Prxy. Then
HSIC(F ;G;Pr
xy
) = EZ [HSIC(F ;G; Z)] : (5)
This property is by contrast with the mutual informa-
tion, which can require sophisticated bias correction
strategies (e.g. Nemenman et al., 2002).
U-Statistics. The estimator in (4) can be alterna-
tively formulated using U-statistics,
HSIC(F ;G; Z) = (m)14
mX
(i;j;q;r)2im4
h(i; j; q; r); (6)
where (m)n = m!(mn)! is the Pochhammer coecient
and where imr denotes the set of all r-tuples drawn
without replacement from f1; : : : ;mg. The kernel h of
the U-statistic is de ned by
1
4!
(i;j;q;r)X
(s;t;u;v)
(Kst Lst + Kst Luv 2 Kst Lsu) ; (7)
where the sum in (7) represents all ordered quadruples
(s; t; u; v) selected without replacement from (i; j; q; r).
We now show that HSIC(F ;G; Z) is concentrated.
Furthermore, its convergence in probability to
HSIC(F ;G;Prxy) occurs with rate 1=
p
m which is a
slight improvement over the convergence of the biased
estimator by Gretton et al. (2005).
Theorem 2 (HSIC is Concentrated) Assume k; l
are bounded almost everywhere by 1, and are non-
negative. Then for m > 1 and all  > 0, with proba-
bility at least 1  for all Prxy
jHSIC(F ;G; Z)HSIC(F ;G;Pr
xy
)j  8
p
log(2=)=m
By virtue of (6) we see immediately that HSIC is a
U-statistic of order 4, where each term is bounded in
[2; 2]. Applying Hoeng's bound as in Gretton et al.
(2005) proves the result.
These two theorems imply the empirical HSIC closely
re
ects its population counterpart. This means
the same features should consistently be selected to
achieve high dependence if the data are repeatedly
drawn from the same distribution.
Asymptotic Normality. It follows from Ser
ing
(1980) that under the assumptions E(h2) < 1 and
that the data and labels are not independent, the em-
pirical HSIC converges in distribution to a Gaussian
random variable with mean HSIC(F ;G;Prxy) and vari-
ance
2HSIC =
16
m

RHSIC2

; where (8)
R = 1m
mX
i=1

(m 1)13
X
(j;q;r)2im3 nfig
h(i; j; q; r)
2
;
and imr nfig denotes the set of all r-tuples drawn with-
out replacement from f1; : : : ;mg n fig. The asymp-
totic normality allows us to formulate statistics for a
signi cance test. This is useful because it may provide
an assessment of the dependence between the selected
features and the labels.
Simple Computation. Note that HSIC(F ;G; Z) is
simple to compute, since only the kernel matrices K
and L are needed, and no density estimation is in-
volved. For feature selection, L is xed through the
whole process. It can be precomputed and stored for
speedup if needed. Note also that HSIC(F ;G; Z) does
not need any explicit regularisation parameter. This
is encapsulated in the choice of the kernels.
3 Feature Selection via HSIC
Having de ned our feature selection criterion, we now
describe an algorithm that conducts feature selection
on the basis of this dependence measure. Using HSIC,
we can perform both backward (BAHSIC) and for-
ward (FOHSIC) selection of the features. In particu-
lar, when we use a linear kernel on the data (there is
no such requirement for the labels), forward selection
Page 4
hidden
Supervised Feature Selection via Dependence Estimation
and backward selection are equivalent: the objective
function decomposes into individual coordinates, and
thus feature selection can be done without recursion in
one go. Although forward selection is computationally
more ecient, backward elimination in general yields
better features, since the quality of the features is as-
sessed within the context of all other features. Hence
we present the backward elimination version of our al-
gorithm here (a forward greedy selection version can
be derived similarly).
BAHSIC appends the features from S to the end of a
list Sy so that the elements towards the end of Sy have
higher relevance to the learning task. The feature se-
lection problem in (1) can be solved by simply taking
the last t elements from Sy. Our algorithm produces
Sy recursively, eliminating the least relevant features
from S and adding them to the end of Sy at each
iteration. For convenience, we also denote HSIC as
HSIC(;S), where S are the features used in comput-
ing the data kernel matrix K, and  is the parameter
for the data kernel (for instance, this might be the size
of a Gaussian kernel k(x; x0) = exp( kx x0k2) ).
Algorithm 1 BAHSIC
Input: The full set of features S
Output: An ordered set of features Sy
1: Sy ?
2: repeat
3:  
4: I arg maxI
P
j2I HSIC(;S nfjg); I  S
5: S S nI
6: Sy Sy [I
7: until S = ?
Step 3 of the algorithm denotes a policy for adapt-
ing the kernel parameters, e.g. by optimising over
the possible parameter choices. In our experiments,
we typically normalize each feature separately to zero
mean and unit variance, and adapt the parameter
for a Gaussian kernel by setting  to 1=(2d), where
d = j S j 1. If we have prior knowledge about the
type of nonlinearity, we can use a kernel with xed
parameters for BAHSIC. In this case, step 3 can be
omitted.
Step 4 of the algorithm is concerned with the selection
of a set I of features to eliminate. While one could
choose a single element of S, this would be inecient
when there are a large number of irrelevant features.
On the other hand, removing too many features at
once risks the loss of relevant features. In our exper-
iments, we found a good compromise between speed
and feature quality was to remove 10% of the current
features at each iteration.
4 Connections to Other Approaches
We now explore connections to other feature selec-
tors. For binary classi cation, an alternative criterion
for selecting features is to check whether the distri-
butions Pr(xjy = 1) and Pr(xjy = 1) di er. For
this purpose one could use Maximum Mean Discrep-
ancy (MMD) (Borgwardt et al., 2006). Likewise, one
could use Kernel Target Alignment (KTA) (Cristianini
et al., 2003) to test directly whether there exists any
correlation between data and labels. KTA has been
used for feature selection. Formally it is de ned as
tr K L =kK kkL k. For computational convenience the
normalisation is often omitted in practise (Neumann
et al., 2005), which leaves us with tr K L. We discuss
this unnormalised variant below.
Let us consider the output kernel l(y; y0) = (y)(y0),
where (1) = m1+ and (1) = m
1
, and m+ and
m are the numbers of positive and negative samples,
respectively. With this kernel choice, we show that
MMD and KTA are closely related to HSIC. The fol-
lowing theorem is proved in the appendix.
Theorem 3 (Connection to MMD and KTA)
Assume the kernel k(x; x0) for the data is bounded and
the kernel for the labels is l(y; y0) = (y)(y0). Then

HSIC (m 1)2MMD

= O(m1)

HSIC (m 1)2KTA

= O(m1):
This means selecting features that maximise HSIC also
maximises MMD and KTA. Note that in general (mul-
ticlass, regression, or generic binary classi cation) this
connection does not hold.
5 Variants of BAHSIC
New variants can be readily derived from BAHSIC by
combining the two building blocks of BAHSIC: a ker-
nel on the data and another one on the labels. Here
we provide three examples using a Gaussian kernel on
the data, while varying the kernel on the labels. This
provides us with feature selectors for three problems:
Binary classi cation (BIN) We set m1+ as the la-
bel for positive class members, and m1 for negative
class members. We then apply a linear kernel.
Multiclass classi cation (MUL) We apply a linear
kernel on the labels using the label vectors below, as
described for a 3-class example. Here mi is the number
Page 5
hidden
Supervised Feature Selection via Dependence Estimation
of samples in class i and 1mi denotes a vector of all
ones with length mi.
Y =
0
B
@
1m1
m1
1m1
m2m
1m1
m3m
1m2
m1m
1m2
m2
1m2
m3m
1m3
m1m
1m3
m2m
1m3
m3
1
C
A
m3
: (9)
Regression (REG) A Gaussian RBF kernel is also
used on the labels. For convenience the kernel width 
is xed as the median distance between points in the
sample (Scholkopf & Smola, 2002).
For the above variants a further speedup of BAHSIC
is possible by updating the entries of the kernel matrix
incrementally, since we are using an RBF kernel. We
use the fact that kx x0k2 =
P
j kxj x
0
jk
2. Hence
kx x0k2 needs to be computed only once. Subse-
quent updates are e ected by subtracting kxj x0jk
2
(subscript here indices dimension).
We will use BIN, MUL and REG as the particular in-
stances of BAHSIC in our experiments. We will refer
to them commonly as BAHSIC since the exact mean-
ing will be clear depending on the datasets encoun-
tered. Furthermore, we also instantiate FOHSIC us-
ing the same kernels as BIN, MUL and REG, and we
adopt the same convention when we refer to it in our
experiments.
6 Experimental Results
We conducted three sets of experiments. The char-
acteristics of the datasets and the aims of the ex-
periments are: (i) arti cial datasets illustrating the
properties of BAHSIC; (ii) real datasets that compare
BAHSIC with other methods; and (iii) a brain com-
puter interface dataset showing that BAHSIC selects
meaningful features.
6.1 Arti cial datasets
We constructed 3 arti cial datasets, as illustrated in
Figure 1, to illustrate the di erence between BAH-
SIC variants with linear and nonlinear kernels. Each
dataset has 22 dimensions | only the rst two dimen-
sions are related to the prediction task and the rest are
just Gaussian noise. These datasets are (i) Binary
XOR data: samples belonging to the same class have
multimodal distributions; (ii) Multiclass data: there
are 4 classes but 3 of them are collinear; (iii) Nonlin-
ear regression data: labels are related to the rst
two dimension of the data by y = x1 exp(x21x
2
2)+,
where  denotes additive Gaussian noise. We compare
BAHSIC to FOHSIC, Pearson's correlation, mutual
information (Za alon & Hutter, 2002), and RELIEF
(RELIEF works only for binary problems). We aim
to show that when nonlinear dependencies exist in the
Figure 1: Arti cial datasets and the performance of dif-
ferent methods when varying the number of observations.
Left column, top to bottom: Binary, multiclass, and
regression data. Di erent classes are encoded with dif-
ferent colours. Right column: Median rank (y-axis) of
the two relevant features as a function of sample size (x-
axis) for the corresponding datasets in the left column.
(Blue circle: Pearson's correlation; Green triangle: RE-
LIEF; Magenta downward triangle: mutual information;
Black triangle: FOHSIC; Red square: BAHSIC.)
data, BAHSIC with nonlinear kernels is very compe-
tent in nding them.
We instantiate the arti cial datasets over a range of
sample sizes (from 40 to 400), and plot the median
rank, produced by various methods, for the rst two
dimensions of the data. All numbers in Figure 1 are
averaged over 10 runs. In all cases, BAHSIC shows
good performance. More speci cally, we observe:
Binary XOR Both BAHSIC and RELIEF correctly
select the rst two dimensions of the data even for
small sample sizes; while FOHSIC, Pearson's correla-
tion, and mutual information fail. This is because the
latter three evaluate the goodness of each feature inde-
pendently. Hence they are unable to capture nonlinear
interaction between features.
Multiclass Data BAHSIC, FOHSIC and mutual in-
formation select the correct features irrespective of the
size of the sample. Pearson's correlation only works for
large sample size. The collinearity of 3 classes provides
linear correlation between the data and the labels, but
due to the interference of the fourth class such corre-
Page 6
hidden
Supervised Feature Selection via Dependence Estimation
lation is picked up by Pearson's correlation only for a
large sample size.
Nonlinear Regression Data The performance
of Pearson's correlation and mutual information is
slightly better than random. BAHSIC and FOHSIC
quickly converge to the correct answer as the sample
size increases.
In fact, we observe that as the sample size increases,
BAHSIC is able to rank the relevant features (the rst
two dimensions) almost correctly in the rst iteration
(results not shown). While this does not prove BAH-
SIC with nonlinear kernels is always better than that
with a linear kernel, it illustrates the competence of
BAHSIC in detecting nonlinear features. This is ob-
viously useful in a real-world situations. The second
advantage of BAHSIC is that it is readily applicable to
both classi cation and regression problems, by simply
choosing a di erent kernel on the labels.
6.2 Real world datasets
Algorithms In this experiment, we show that the
performance of BAHSIC can be comparable to other
state-of-the-art feature selectors, namely SVM Re-
cursive Feature Elimination (RFE) (Guyon et al.,
2002), RELIEF (Kira & Rendell, 1992), L0-norm SVM
( L0) (Weston et al., 2003), and R2W2 (Weston et al.,
2000). We used the implementation of these algo-
rithms as given in the Spider machine learning toolbox,
since those were the only publicly available implemen-
tations.1 Furthermore, we also include lter methods,
namely FOHSIC, Pearson's correlation (PC), and mu-
tual information (MI), in our comparisons.
Datasets We used various real world datasets taken
from the UCI repository,2 the Statlib repository,3 the
LibSVM website,4 and the NIPS feature selection chal-
lenge5 for comparison. Due to scalability issues in Spi-
der, we produced a balanced random sample of size less
than 2000 for datasets with more than 2000 samples.
Experimental Protocol We report the perfor-
mance of an SVM using a Gaussian kernel on a feature
subset of size 5 and 10-fold cross-validation. These 5
features were selected per fold using di erent meth-
ods. Since we are comparing the selected features, we
1http://www.kyb.tuebingen.mpg.de/bs/people/
spider
2http://www.ics.uci.edu/mlearn/MLSummary.html
3http://lib.stat.cmu.edu/datasets/
4http://www.csie.ntu.edu.tw/cjlin/
libsvmtools/datasets/
5http://clopinet.com/isabelle/Projects/
NIPS2003/
used the same SVM for all methods: a Gaussian ker-
nel with  set as the median distance between points
in the sample (Scholkopf & Smola, 2002) and regular-
ization parameter C = 100. On classi cation datasets,
we measured the performance using the error rate, and
on regression datasets we used the percentage of vari-
ance not-explained (also known as 1r2). The results
for binary datasets are summarized in the rst part of
Table 1. Those for multiclass and regression datasets
are reported respectively in the second and the third
parts of Table 1.
To provide a concise summary of the performance of
various methods on binary datasets, we measured how
the methods compare with the best performing one in
each dataset in Table 1. We recorded the best abso-
lute performance of all feature selectors as the base-
line, and computed the distance of each algorithm to
the best possible result. In this context it makes sense
to penalize catastrophic failures more than small devi-
ations. In other words, we would like to have a method
which is at least almost always very close to the best
performing one. Taking the `2 distance achieves this
e ect, by penalizing larger di erences more heavily. It
is also our goal to choose an algorithm that performs
homogeneously well across all datasets. The `2 dis-
tance scores are listed for the binary datasets in Table
1. In general, the smaller the `2 distance, the better
the method. In this respect, BAHSIC and FOHSIC
have the best performance. We did not produce the `2
distance for multiclass and regression datasets, since
the limited number of such datasets did not allow us
to draw statistically signi cant conclusions.
6.3 Brain-computer interface dataset
In this experiment, we show that BAHSIC selects fea-
tures that are meaningful in practise: we use BAHSIC
to select a frequency band for a brain-computer inter-
face (BCI) data set from the Berlin BCI group (Dorn-
hege et al., 2004). The data contains EEG signals
(118 channels, sampled at 100 Hz) from ve healthy
subjects (`aa', `al', `av', `aw' and `ay') recorded dur-
ing two types of motor imaginations. The task is to
classify the imagination for individual trials.
Our experiment proceeded in 3 steps: (i) A Fast
Fourier transformation (FFT) was performed on each
Table 2: Classi cation errors (%) on BCI data after select-
ing a frequency range.
Subject aa al av aw ay
CSP 17.52.5 3.11.2 32.12.5 7.32.7 6.01.6
CSSP 14.92.9 2.41.3 33.02.7 5.41.9 6.21.5
CSSSP 12.22.1 2.20.9 31.82.8 6.31.8 12.72.0
BAHSIC 13.74.3 1.91.3 30.53.3 6.13.8 9.06.0
Page 7
hidden
Supervised Feature Selection via Dependence Estimation
Table 1: Classi cation error (%) or percentage of variance not-explained (%). The best result, and those results not
signi cantly worse than it, are highlighted in bold (one-sided Welch t-test with 95% con dence level). 100.00.0:
program is not nished in a week or crashed. -: not applicable.
Data BAHSIC FOHSIC PC MI RFE RELIEF L0 R2W2
covertype 26.31.5 37.91.7 40.31.3 26.71.1 33.01.9 42.70.7 43.40.7 44.21.7
ionosphere 12.31.7 12.81.6 12.31.5 13.11.7 20.22.2 11.72.0 35.90.4 13.72.7
sonar 27.93.1 25.02.3 25.52.4 26.91.9 21.63.4 24.02.4 36.53.3 32.31.8
heart 14.82.4 14.42.4 16.72.4 15.22.5 21.93.0 21.93.4 30.72.8 19.32.6
breastcancer 3.80.4 3.80.4 4.00.4 3.50.5 3.40.6 3.10.3 32.72.3 3.40.4
australian 14.31.3 14.31.3 14.51.3 14.51.3 14.81.2 14.51.3 35.91.0 14.51.3
splice 22.61.1 22.61.1 22.80.9 21.91.0 20.71.0 22.31.0 45.21.2 24.01.0
svmguide3 20.80.6 20.90.6 21.20.6 20.40.7 21.00.7 21.60.4 23.30.3 23.90.2
adult 24.80.2 24.40.6 18.31.1 21.61.1 21.30.9 24.40.2 24.70.1 100.00.0
cleveland 19.02.1 20.51.9 21.91.7 19.52.2 20.92.1 22.42.5 25.20.6 21.51.3
derm 0.30.3 0.30.3 0.30.3 0.30.3 0.30.3 0.30.3 24.32.6 0.30.3
hepatitis 13.83.5 15.02.5 15.04.1 15.04.1 15.02.5 17.52.0 16.31.9 17.52.0
musk 29.92.5 29.61.8 26.92.0 31.92.0 34.72.5 27.71.6 42.62.2 36.42.4
optdigits 0.50.2 0.50.2 0.50.2 3.40.6 3.01.6 0.90.3 12.51.7 0.80.3
specft 20.02.8 20.02.8 18.83.4 18.83.4 37.56.7 26.33.5 36.34.4 31.33.4
wdbc 5.30.6 5.30.6 5.30.7 6.70.5 7.71.8 7.21.0 16.72.7 6.81.2
wine 1.71.1 1.71.1 1.71.1 1.71.1 3.41.4 4.21.9 25.17.2 1.71.1
german 29.21.9 29.21.8 26.21.5 26.21.7 27.22.4 33.21.1 32.00.0 24.81.4
gisette 12.41.0 13.00.9 16.00.7 50.00.0 42.81.3 16.70.6 42.70.7 100.00.0
arcene 22.05.1 19.03.1 31.03.5 45.02.7 34.04.5 30.03.9 46.06.2 32.05.5
madelon 37.90.8 38.00.7 38.40.6 51.61.0 41.50.8 38.60.7 51.31.1 100.00.0
`2 11.2 14.8 19.7 48.6 42.2 25.9 85.0 138.3
satimage 15.81.0 17.90.8 52.61.7 22.70.9 18.71.3 - 22.11.8 -
segment 28.61.3 33.90.9 22.90.5 27.11.3 24.50.8 - 68.77.1 -
vehicle 36.41.5 48.72.2 42.81.4 45.82.5 35.71.3 - 40.71.4 -
svmguide2 22.82.7 22.22.8 26.42.5 27.41.6 35.61.3 - 34.51.7 -
vowel 44.72.0 44.72.0 48.12.0 45.42.2 51.92.0 - 85.61.0 -
usps 43.41.3 43.41.3 73.72.2 67.81.8 55.82.6 - 67.02.2 -
housing 18.52.6 18.93.6 25.32.5 18.92.7 - - - -
bodyfat 3.52.5 3.52.5 3.42.5 3.42.5 - - - -
abalone 55.12.7 55.92.9 54.23.3 56.52.6 - - - -
Figure 2: HSIC, encoded by the colour value for di erent frequency bands (axes correspond to upper and lower cuto
frequencies). The gures, left to right, top to bottom correspond to subjects `aa', `al', `av', `aw' and `ay'.
channel and the power spectrum was computed. (ii)
The power spectra from all channels were averaged to
obtain a single spectrum for each trial. (iii) BAH-
SIC was used to select the top 5 discriminative fre-
quency components based on the power spectrum. The
5 selected frequencies and their 4 nearest neighbours
were used to reconstruct the temporal signals (with all
other Fourier coecients eliminated). The result was
then passed to a normal CSP method (Dornhege et al.,
2004) for feature extraction, and then classi ed using
a linear SVM.
We compared automatic ltering using BAHSIC to
other ltering approaches: normal CSP method with
manual ltering (8-40 Hz), the CSSP method (Lemm
et al., 2005), and the CSSSP method (Dornhege et al.,
2006). All results presented in Table 2 are obtained
using 50 2-fold cross-validation. Our method is very
competitive and obtains the rst and second place for
4 of the 5 subjects. While the CSSP and the CSSSP
methods are specialised embedded methods (w.r.t. the
CSP method) for frequency selection on BCI data, our
method is entirely generic: BAHSIC decouples feature
selection from CSP.
In Figure 2, we use HSIC to visualise the responsive-
ness of di erent frequency bands to motor imagination.
The horizontal and the vertical axes in each sub g-
ure represent the lower and upper bounds for a fre-
quency band, respectively. HSIC is computed for each
of these bands. Dornhege et al. (2006) report that the
 rhythm (approx. 12 Hz) of EEG is most responsive to
motor imagination, and that the rhythm (approx. 22
Hz) is also responsive. We expect that HSIC will cre-
ate a strong peak at the  rhythm and a weaker peak
at the rhythm, and the absence of other respon-
sive frequency components will create block patterns.
Both predictions are con rmed in Figure 2. Further-
Page 8
hidden
Supervised Feature Selection via Dependence Estimation
more, the large area of the red region for subject `al'
indicates good responsiveness of his  rhythm. This
also corresponds well with the lowest classi cation er-
ror obtained for him in Table 2.
7 Conclusion
This paper proposes a backward elimination procedure
for feature selection using the Hilbert-Schmidt Inde-
pendence Criterion (HSIC). The idea behind the re-
sulting algorithm, BAHSIC, is to choose the feature
subset that maximises the dependence between the
data and labels. With this interpretation, BAHSIC
provides a uni ed feature selection framework for any
form of supervised learning. The absence of bias and
good convergence properties of the empirical HSIC es-
timate provide a strong theoretical juti cation for us-
ing HSIC in this context. Although BAHSIC is a lter
method, it still demonstrates good performance com-
pared with more specialised methods in both arti cial
and real world data. It is also very competitive in
terms of runtime performance.6
Acknowledgments NICTA is funded through the
Australian Government's Baking Australia's Ability
initiative, in part through the ARC.This research was
supported by the Pascal Network (IST-2002-506778).
Appendix
Proof [Theorem 1] Recall that Kii = Lii = 0. We
prove the claim by constructing unbiased estimators
for each term in (3). Note that we have three types
of expectations, namely Exy Ex0y0 , a partially decou-
pled expectation Exy Ex0 Ey0 , and Ex Ey Ex0 Ey0 , which
takes all four expectations independently.
If we want to replace the expectations by empirical av-
erages, we need to take care to avoid using the same
discrete indices more than once for independent ran-
dom variables. In other words, when taking expecta-
tions over r independent random variables, we need r-
tuples of indices where each index occurs exactly once.
The sets imr satisfy this property. Their cardinalities
are given by the Pochhammer symbols (m)r. Jointly
drawn random variables, on the other hand, share the
same index. We have
Exy Ex0y0 [k(x; x0)l(y; y0)] =EZ
h
(m)12
X
(i;j)2im2
Kij Lij
i
=EZ

(m)12 tr K L

:
In the case of the expectation over three independent
6Code is freely available as part of the Elefant package
at http://elefant.developer.nicta.com.au.
terms Exy Ex0 Ey0 we obtain
EZ
h
(m)13
X
(i;j;q)2im3
Kij Liq
i
= EZ

(m)13 1
>K L 1 tr K L

:
For four independent random variables Ex Ey Ex0 Ey0 ,
EZ
h
(m)14
X
(i;j;q;r)2im4
Kij Lqr
i
=EZ

(m)14

1>K 1 1> L 14 1>K L 1 +2 tr K L

:
To obtain an expression for HSIC we only need to take
linear combinations using (3). Collecting terms related
to tr K L, 1>K L 1, and 1>K 1 1> L 1 yields
HSIC(F ;G;Pr
xy
)
= 1m(m3) EZ
h
tr K L +1
>K11> L1
(m1)(m2)
2
m2 1
>K L 1
i
:
This is the expected value of HSIC[F ;G; Z].
Proof [Theorem 3] We rst relate a biased estimator
of HSIC to the biased estimator of MMD. The former
is given by
1
(m1)2 tr KHLH where H = Im
1 1 1>
and the bias is bounded by O(m1), as shown by Gret-
ton et al. (2005). An estimator of MMD with bias
O(m1) is
MMD[F ; Z] =
1
m2+
m+X
i;j
k(xi;xj) +
1
m2
mX
i;j
k(xi;xj)

2
m+m
m+X
i
mX
j
k(xi;xj) = tr K L :
If we choose l(y; y0) = (y)(y0) with (1) = m1+
and (1) = m1 , we can see L 1 = 0. In this case
tr K H L H = tr K L, which shows that the biased es-
timators of MMD and HSIC are identical up to a con-
stant factor. Since the bias of tr K H L H is O(m1),
this implies the same bias for the MMD estimate.
To see the same result for Kernel Target Alignment,
note that for equal class size the normalisations with
regard to m+ and m become irrelevant, which yields
the corresponding MMD term.
References
Baker, C. (1973). Joint measures and cross-covariance op-
erators. Transactions of the American Mathematical So-
ciety, 186, 273{289.
Page 9
hidden
Supervised Feature Selection via Dependence Estimation
Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-
P., Scholkopf, B., & Smola, A. J. (2006). Integrating
structured biological data by kernel maximum mean dis-
crepancy. Bioinformatics (ISMB), 22 (14), e49{e57.
Cristianini, N., Kandola, J., Elissee , A., & Shawe-Taylor,
J. (2003). On optimizing kernel alignment. Tech. rep.,
UC Davis Department of Statistics.
Dornhege, G., Blankertz, B., Curio, G., & Muller, K.
(2004). Boosting bit rates in non-invasive EEG single-
trial classi cations by feature combination and multi-
class paradigms. IEEE Trans. Biomed. Eng., 51, 993{
1002.
Dornhege, G., Blankertz, B., Krauledat, M., Losch, F., Cu-
rio, G., & Muller, K. (2006). Optimizing spatio-temporal
lters for improving BCI. In NIPS, vol. 18.
Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004). Di-
mensionality reduction for supervised learning with re-
producing kernel hilbert spaces. JMLR, 5, 73{99.
Gretton, A., Bousquet, O., Smola, A., & Scholkopf, B.
(2005). Measuring statistical dependence with Hilbert-
Schmidt norms. In ALT, 63{78.
Guyon, I., & Elissee , A. (2003). An introduction to vari-
able and feature selection. Journal of Machine Learning
Research, 3, 1157{1182.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002).
Gene selection for cancer classi cation using support
vector machines. Machine Learning, 46, 389{422.
Kira, K., & Rendell, L. (1992). A practical approach to fea-
ture selection. In Proc. 9th Intl. Workshop on Machine
Learning, 249{256.
Koller, D., & Sahami, M. (1996). Toward optimal feature
selection. In ICML, 284{292.
Lemm, S., Blankertz, B., Curio, G., & Mulller, K.-R.
(2005). Spatio-spectral lters for improving the classi -
cation of single trial EEG. IEEE Trans. Biomed. Eng.,
52, 1541{1548.
Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy
and inference, revisited. In NIPS, vol. 14.
Neumann, J., Schnorr, C., & Steidl, G. (2005). Combined
SVM-based feature selection and classi cation. Machine
Learning, 61, 129{150.
Scholkopf, B., & Smola, A. (2002). Learning with Kernels.
Cambridge, MA: MIT Press.
Ser
ing, R. (1980). Approximation Theorems of Mathe-
matical Statistics. New York: Wiley.
Steinwart, I. (2002). On the in
uence of the kernel on the
consistency of svms. JMLR, 2, 67{93.
Weston, J., Elissee , A., Scholkopf, B., & Tipping, M.
(2003). Use of zero-norm with linear models and ker-
nel methods. JMLR, 3, 1439{1461.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Pog-
gio, T., & Vapnik, V. (2000). Feature selection for
SVMs. In NIPS, vol. 13.
Za alon, M., & Hutter, M. (2002). Robust feature selection
using distributions of mutual information. In UAI.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

39 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
36% Ph.D. Student
 
23% Post Doc
 
10% Student (Master)
by Country
 
15% United States
 
13% China
 
10% Germany