Sign up & Download
Sign in

Microarray Design Using the Hilbert – Schmidt Independence Criterion

by Justin Bedo
Design (2008)

Cite this document (BETA)

Available from Design
Page 1
hidden

Microarray Design Using the Hilbert – Schmidt Independence Criterion

Microarray Design using the Hilbert{Schmidt
Independence Criterion
Justin Bedo
The Australian National University,
NICTA, and the University of Melbourne
Abstract. This paper explores the design problem of selecting a small
subset of clones from a large pool for creation of a microarray plate.
A new kernel based unsupervised feature selection method using the
Hilbert{Schmidt independence criterion (hsic) is presented and evalu-
ated on three microarray datasets: the Alon colon cancer dataset, the
van 't Veer breast cancer dataset, and a multiclass cancer of unknown
primary dataset. The experiments show that subsets selected by the hsic
resulted in equivalent or better performance than supervised feature se-
lection, with the added bene t that the subsets are not target speci c.
1 Introduction
Feature selection is an important procedure in data mining. The elimination of
features leads to smaller and more interpretable models and can improve gen-
eralisation performance. Supervised methods produce feature subsets tailored
towards the prediction target and are applicable when labels are available. In
contrast, unsupervised methods select features that capture some of the infor-
mation contained within the whole dataset without requiring labels; as no labels
are used the feature selections are not target speci c.
The problem of designing a sugarcane microarray plate by choosing a subset
of approximately 7000 clones from an initial pool of 50,000 clones is studied
herein. As the initial pool of clones contains some highly correlated pairs, there
is a preference towards decorrelation. Furthermore, the array must remain as
general as possible and not be tailored towards any speci c phenotypes. As
such, this is an unsupervised selection problem.
The Hilbert{Schmidt independence criterion [1] (hsic) hsic is a dependence
measure between two random variables which is closely related to kernel target
alignment [2] and maximum mean discrepancy [3] (mmd). Previous papers [4,5]
used the hsic for supervised feature selection and demonstrated that the method
had good performance and
exibility on several genomics datasets. This paper
presents an unsupervised variant named unsupervised feature selection by the
hsic (ubhsic, pronounced ["u.b@-sik]).
Ubhsic was evaluated in several experiments comparing the selection using
various kernels to supervised feature selection. As labels are not available on the
sugarcane dataset, ubhsic was evaluated on three cancer genomics datasets: the
Page 2
hidden
2 Justin Bedo
Alon colon cancer dataset [6,7,8,9], the van 't Veer breast cancer dataset [10],
and a multiclass cancer of unknown primary (cup) dataset [11]. The cup dataset
closely resembles the sugarcane problem as it was intended for the development
of a clinical test on a lower resolution platform.
2 Hsic and Ubhsic
The hsic is a quantity that measures the mutual dependence between two vari-
ables. For the task of unsupervised feature selection, the dependence between
subsets of features and the full set of features is measured by the hsic; a subset
with maximum dependence on the full dataset is desired. This section gives an
overview of the hsic and speci es the unsupervised feature selection problem as
a constrained optimisation problem.
Let
X :=
2
6
4
x11    x1m
...
. . .
...
xn1    xnm
3
7
5
be a nite dataset in matrix form with xij 2 R, where n is the number of samples
and m is the number of features. Each row xi corresponds to a sample, and each
column xj corresponds to a feature.
Let  2 2m be a subset of features, where 2m denotes the power set of
f1; : : : ;mg, and de ne X as the dataset restricted to only the features in ,
i.e., the features with indices not in  are discarded. By this de nition, X is a
matrix with dimension njj, where j  j denotes set cardinality. The dependence
between the reduced dataset X and the full dataset X is the quantity we wish to
maximise. The hsic measures this dependence through kernel functions [12,13].
A kernel function de nes the inner product between two points of a Hilbert
space, and can be considered intuitively as a measure of similarity. Indeed, the
correlation function cor(x;x0) := h
x;x0i
jjxjjjjx0jj , where h; i denotes the inner product
and jjxjj =
p
hx;xi, is a kernel function used in the experiments section. Given
a kernel function k, the kernel matrix is de ned [Kij ]1i;jn := k(xi;xj). The
kernel matrix of the full dataset X is referred to as K, and the kernel matrix of
the reduced dataset X as K.
An estimator for the hsic using these two kernel matrices [1] is
tr(KHKH); (1)
where tr is the matrix trace (the sum over the elements of the main diagonal),
H := Id 1n , Id is the identity matrix, and the subtraction is element wise.
Using this dependence measure, the unsupervised selection task can simply be
stated as
max

tr(KHKH) (2)
such that
jj = m0
Page 3
hidden
Unsupervised Feature Selection using the HSIC 3
for some 1  m0 < m. Solving this optimisation equation for a set  gives the
ubhsic solution.
The solution to the optimisation equation is explicit in the linear kernel case
where K := XXT . Let M := HKH. The hsic estimator is then
tr(KM) = tr(XXTM)
= tr(XTMX)
=
X
j
xTjMxj :
Thus, in the case of a linear kernel the features are independent and can be
ranked by xTjMxj and greedily selected.
For other kernels, an analytical solution does not exist and a good subset
must be found through searching. The forward selection and recursive elimina-
tion greedy nested subset strategies [14] can be used to nd a good solution if the
number of features is not large. This approach was used for the supervised vari-
ant presented by Song et al. [4,5]. Alternatively, a good solution can be found
using combinatorial optimisation algorithms such as simulated annealing. For
the sugarcane dataset, a good solution to (2) is found using an annealing algo-
rithm as nested subset selection is unattractive due to the large pool of initial
features. Ecient solving the optimisation problem is an open problem.
3 Results and Discussion
The proposed method was analysed on several cancer genomics datasets using
di erent kernels. These kernels are de ned as follows:
Radial basis function (rbf): k(x;x0) := exp(jjxx0jj22) with  set as the
inverse median of the squared distances jjx x0jj22 between points in the
dataset
Linear: k(x;x0) := hx;x0i
Polynomial: k(x;x0) := (hx;x0i+ 1)d for d 2 f2; 3g
Variance: k(x;x0) := h
x;x0i2
hx;xihx0;x0i
The variance kernel was chosen to produce highly decorrelated selections. The
preference towards decorrelation is indirectly encoded as hx;x0i =
p
hx;xi hx0;x0i
is the cosine of the angle between the two vectors x and x0. Thus, as adding
a feature highly correlated with another already selected feature will not a ect
the angle between the vectors as much as a feature orthogonal to all selected
features, one may postulate that the kernel used with ubhsic will produce highly
decorrelated selections.
Three cancer genomics datasets were analysed, the van 't Veer breast cancer
dataset [10] and a colon cancer dataset [6,7,8,9]. The van 't Veer dataset consists
of 98 samples, 46 with a distant metastasis and 52 with no metastasis. Each
sample has 5952 dimensions. The colon cancer dataset has 62 samples, 22 normal
Page 4
hidden
4 Justin Bedo
and 40 cancerous, and 2000 dimensions per sample. Both datasets are 2-class
classi cation problems.
The nal cancer genomics dataset is a cancer of unknown primary (cup)
dataset [11]. This is a multiclass classi cation dataset where the aim is to develop
a predictor for the site of origin of a tumour from a microarray of a sample. The
dataset consists of 14 classes, 220 samples, and 9630 features. Not each class is
represented equally, with the smallest class containing only 3 samples and the
largest containing 34.
To gauge the utility of feature subsets selected by ubhsic for prediction, the
reduced datasets were evaluated using supervised classi cation and generalisa-
tion estimation. The performance achievable from the reduced datasets were also
compared to a fully supervised selection approach.
The classi cation and supervised feature selection algorithm used was a cen-
troid based classi er and supervised feature selector [15]. This method was cho-
sen as it is simple, fast, and has performed well on these particular datasets [15].
For the multiclass cup dataset, a one-vs-all architecture [16] was used in con-
junction with the centroid classi er to produce a multiclass classi er. For gen-
eralisation performance estimation, the -0 bootstrap estimator [17] was used
with 200 repetitions. The area under the roc curve (aroc) [18] was used as
a performance metric for the two-class datasets. A multiclass extension to the
aroc was used [19] for the cup dataset.
Each dataset was analysed by applying ubhsic with the various kernels to
reduce the full dataset. The centroid classi er and supervised feature selector
was then applied to the ubhsic reduced datasets to evaluate the performance.
The same centroid classi er and supervised feature selector was applied to the
full dataset to obtain the performance achievable using supervised selection only
without any ubhsic pre- ltering.
Figure 1 shows the results of pre- ltering using ubhsic down to 50 (Sub g-
ure 1a) and 500 features (Sub gure 1b) on the van 't Veer dataset. With the
reduction to 500 features, the linear, rbf and variance kernels do very well; they
achieve a level of performance equivalent to the full dataset at higher numbers
of features and exceed the performance at lower numbers of features. The two
polynomial kernels initially perform poorly, but after mild supervised feature
selection the performance equals that of the other kernels and the full dataset.
Under aggressive reduction down to 50 features, somewhat surprising results are
obtained; the maximum performance achieved was substantially better than the
full dataset using a polynomial kernel of degree 2 despite the operating with only
32 features. Furthermore, the variance kernel achieves very high performance at
the eight features operating point. Both are signi cantly fewer than the original
70 genes proposed for classi cation by the original paper [10].
Performing the same experiments on the colon cancer dataset yielded the
results in Figure 2. Again, strong performance when using the variance and
rbf kernels is observable in Sub gure 2b; rbf produced very good results after
further supervised ltering down to a few features (4) while the variance kernel
produced very similar results to the full dataset. The linear and polynomial
Page 5
hidden
Unsupervised Feature Selection using the HSIC 5
N. Features
ARO
C
0.65
0.70
0.75
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 5951
l
l
l
l l
l l
l l l
l l l l
FullLinearPolynomial d=2Polynomial d=3RBFVar.
l
(a) Aggressive reduction to 50 features
N. Features
ARO
C
0.64
0.66
0.68
0.70
0.72
0.74
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 5951
l
l
l
l
l
l
l
l l
l l l l l
FullLinearPolynomial d=2Polynomial d=3RBFVar.
l
(b) Reduction to 500 features
Fig. 1: van 't Veer dataset with centroid classi er and feature selector. Results
are using the -0 bootstrap with 200 repetitions. Error bars show 95% con dence
interval. Sub gure (a) shows the performance of the dataset reduced to 50 fea-
tures using the ubhsic procedure and various kernels. Each plot corresponds to
a di erent kernel, with the purple plot corresponding to the cfs-centroid method
on the entire dataset (i.e., without pre ltering using ubhsic). The 5 plots where
pre ltering using ubhsic was used do not extend above 50 features, and further
supervised ltering using the cfs was applied to determine the maximum perfor-
mance achievable from the reduced datasets. Sub gure (b) is similar to sub gure
a, except with less aggressive ubhsic reduction (reduced to 500 features instead
of 50).
kernels do not perform well on this dataset; this is supported by the results
shown in Sub gure 2a where the linear and polynomial kernels again perform
poorly, but the rbf and variance kernels perform well.
Finally, the results of applying the unsupervised feature selection to the cup
dataset is shown in Figure 3. As this dataset is a larger dataset (220 samples)
than both the colon and van 't Veer datasets, a less aggressive ltering was
applied. Sub gure 3b shows the performance curves obtained after ltering to 500
features. At 500 features, the variance kernel produces a subset with equivalent
performance to the full dataset. At the aggressive reduction to 100 features, the
performance does not su er greatly for the variance kernel. The other kernels do
not perform well on this dataset.
Furthermore, the 500 feature subset selected by the variance kernel outper-
formed the full dataset at low numbers of features. The performance achieved
below 32 features is greater than the performance at the same operating point
obtained with the full dataset. Given this performance, a satisfactory operating
Page 6
hidden
6 Justin Bedo
N. Features
ARO
C
0.65
0.70
0.75
0.80
0.85
1 2 4 8 16 32 64 128 256 512 1024 2000
l
l
l l l l l l l l l l
FullLinearPolynomial d=2Polynomial d=3RBFVar.
l
(a) Aggressive reduction to 50 features
N. Features
ARO
C
0.75
0.80
0.85
1 2 4 8 16 32 64 128 256 512 1024 2000
l
l
l l l l l
l l l l l
FullLinearPolynomial d=2Polynomial d=3RBFVar.
l
(b) Reduction to 500 features
Fig. 2: Colon cancer dataset with centroid classi er and feature selector. -0
bootstrap with 200 repetitions. Error bars show 95% con dence interval. The
experiment is identical to Figure 1, except with a di erent dataset.
point at 16 features or even 8 features per class may be chosen, resulting in a
very sparse predictor.
In summary, these results show that unsupervised pre- ltering does not de-
grade the classi cation performance and can actually improve the performance
at few features. The rbf and variance kernels perform very well across both
two-class datasets, with the other kernels not performing as consistently. On the
multiclass dataset, the variance kernel is the only kernel that performed well.
The aggressive feature reduction down to 50 features for the two-class datasets
and 100 features for the cup dataset showed surprisingly good performance, sug-
gesting that the full datasets contains signi cant redundancy and can be highly
compressed without signi cant loss of performance.
3.1 van 't Veer in detail
To gain a better understanding of the relation between features selected by
ubhsic, the feature subsets obtained on the van 't Veer data were visualised.
Sub gure 4a shows the full un ltered dataset projected down onto the rst two
principal components with each sample represented by a number. It is clear
from the projection that sample 10 is an outlier, sitting far away from the other
samples. Excluding this sample and reprojecting the data obtains the embedding
shown in Sub gure 4b. Here one can observe that the samples roughly form two
groups separated mostly by the rst principal component (x-axis).
Page 7
hidden
Unsupervised Feature Selection using the HSIC 7
N. Features
ARO
C
0.70
0.75
0.80
0.85
0.90
0.95
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 9630
l
l
l
l l
l l l l l l l l l l
FullLinearPolynomial d=2Polynomial d=3RBFVar.
l
(a) Reduction to 100 features
N. Features
ARO
C
0.75
0.80
0.85
0.90
0.95
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 9630
l
l
l
l l
l l l l l l l l l l
FullLinearPolynomial d=2Polynomial d=3RBFVar.
l
(b) Reduction to 500 features
Fig. 3: cup cancer dataset with centroid classi er and feature selector. -0 boot-
strap with 200 repetitions. Error bars show 95% con dence interval. Number of
features shown is per class not overall. Experiment details are as in Figure 1.
Sub gure 5a displays a biplot [20] of the dataset ltered down to 100 features
using the linear kernel and ubhsic. In the gure, samples are shown as black
points and features as red vectors. If two feature vectors have a small angle then
they are highly correlated. From the gure the two-group structure observable on
the original projection (Sub gure 4b) is maintained. Furthermore, the selected
features are strongly positioned along the rst principal component. This is not
unexpected as a linear kernel is expected to favour the rst principal component,
and as features are selected independently it is also expected to select highly
correlated feature sets. Indeed, a selection of 100 features most correlated with
the rst principal component yields a subset of features with 77 features in
common with the subset selected by the linear kernel and ubhsic.
The biplot produced using the rbf kernel (Figure 6) resembles the linear
kernel results in that the two-group structure is preserved with many features
selected along the rst principal component. However, in comparison the features
are more spread out in two fan-like structures, each spanning one of the groups
well, whereas the \fans" formed by the linear kernel are not as spread out and
well aligned with the groups. The rbf kernel is selecting sets with high cross-
correlation; this is evident from the number of feature vectors with small interior
angles.
Running the same analysis using the polynomial lter of degree 2 yields
the results shown in Figure 6. Interestingly, the selected feature subset appears
to have generated an outlier that is clearly visible in Sub gure 6a; removing
this outlier produces a vastly di erent projection as shown in Sub gure 6b.
In this gure the feature vectors can be observed to have a \radial" pattern,
Page 8
hidden
8 Justin Bedo
−80 −60 −40 −20 0 20 40 60
0
50
100
150
PC1
PC2
1
23 45 67
8
9
10
11
12
1314 151617 1819 20 21
22
23242526
27
28 29
30313233 3435 36
37 38
3940 41
424344 45 464748 49 5051 5253
5455657
5859606 6636 65
6667
68
69
70
71
72
73
747576
77 7879
80
81
8283
84 85 8687
8889
90
91
92
93
9495
99798
(a) All samples
−40 −20 0 20 40 60−
60

40

20
0
20
40
PC1
PC2
1
2
3 4
5
67
8
9
10
11
1213 14
15 16
1718 19
20
21
222324
25 26
27
28
29
3031
32 3334 35
36
37
38
39
40
41
42
43 44
45
46
47
48 49
50 5152
53
545556 57
58
5960
61
6263 64
6566
67
68
69
70
71
72
7374
75
76
7778
79
80
8182
83
84
85
86
87
88
89
90
91
92
9394
9
9697
(b) Outlier removed
Fig. 4: Biplot of samples and features projected onto rst two principal compo-
nents using the full van 't Veer dataset. The x-axis is the rst principal compo-
nent, and the y-axis is the second. The sample marked as 10 in sub gure a is
clearly an outlier; removing the outlier and reprojecting the samples produces
the embedding shown in sub gure b.
indicating the selected features do not have high cross-correlation. Similar results
are obtained with the polynomial kernel of degree 3 (not shown). The indication
here is that polynomial kernels tend to favour feature subsets with lower cross-
correlation than the rbf and liner kernels.
Finally, the variance kernel is shown in Figure 7. Unlike the polynomial ker-
nel, the variance kernel did not produce any new outliers and resulted in a more
\radial" pattern than the polynomial lter. This indicates that the selected fea-
tures are highly decorrelated as postulated previously.
These results indicate the linear and rbf kernels produce subsets with high
cross-correlations; the linear kernel is especially highly cross-correlated and aligned
with the rst principal component while the rbf kernel spans the samples well
and is less cross-correlated. The polynomial kernel and variance kernels result
in much more decorrelated results, with the variance kernel producing highly
decorrelated selections. Given the classi cation performance observed on the
van 't Veer datasets, the rbf and variance kernels are both good choices and
can be selected depending if one wishes to obtain whitened data or not.
4 Conclusions
A method for unsupervised feature selection, ubhsic, was presented and evalu-
ated on several bioinformatics datasets. The results were very promising: on the
Page 11
hidden
Unsupervised Feature Selection using the HSIC 11
−0.2 −0.1 0.0 0.1 0.2

0.2

0.1
0.0
0.1
0.2
PC1
PC2
1
2
34
5
6
7
8
9
10
11
12
13
14
156
17 18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 3435
36
37
38
39
40
41
42 43
44
45
46
47
4849
5051
52
53 54
55
56
57
58
59 60
61
62 63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
−5 0 5

5
0
5
Contig52320
AL133108
NM_002634
NM_002639
NM_003380
AB007962
Contig1778_RC
NM_004126
NM_004159
NM_004172
Contig37204_RC
Contig14375_RC
NM_003582Contig24903_RCContig33385_RC
NM_021038
NM_004335
Contig31288_RC
NM_005100
Contig27228_RC
Contig12803_RC
Contig43859_RC
AL117604
NM_013231
NM_005322
NM_020675
NM_020686
D50914
NM_004689
Contig54477_RC
Contig46327_RC
NM_014181
NM_006207
NM_004811
NM_006271
NM_005558
NM_005587NM_004877
X07979
Contig13957_RCNM_007074
NM_004900
AB002308
Contig52891_RC
NM_005780
Contig35257_RC
AB002449
NM_006520
NM_014575
Contig40212_RC
Contig30828_RC
Contig55474_RC
AI928427_RC
Contig57644_RC
Contig8347_RC
Contig50004_RC
NM_016121
NM_015416NM_006763NM_006847
Contig56843_RC
Contig33998_RC
Contig66347
Contig55094_RC
Contig27718_RC
Contig52623_RC
Contig35949_RC
Contig42882_RC
NM_015907
NM_000057
NM_000073
AF007150
AF007155
NM_017540Contig29647_RC
NM_000187
NM_019013
NM_000265
NM_016929
NM_000299
NM_018410
Contig47297_RC
NM_000365
Contig50367
AB023152
AB023163
NM_018556
Contig50802_RC
U28831
AL050205
NM_002029
Contig693_RC
Contig54342_RC
AL050372
NM_001453
X98260 AL157484
NM_001546 NM_002290
Contig44690_RC
Fig. 7: Biplot after ltering the van 't Veer dataset down to 100 features using the
variance kernel. A highly radial pattern is visible, more-so than the polynomial
kernel, with no clear outliers.
9. Huang, T., Kecman, V.: Gene extraction for cancer diagnosis by support vector
machines|an improvement. Arti cial Intelligence In Medicine (Jan 2005)
10. van 't Veer, L., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M.,
Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J.,
Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R., Friend, S.: Gene expres-
sion pro ling predicts clinical outcome of breast cancer. Nature 415(6871) (Jan
2002) 530{6
11. Tothill, R.W., Kowalczyk, A., Rischin, D., Bousioutas, A., Haviv, I., van Laar,
R.K., Waring, P.M., Zalcberg, J., Ward, R., Biankin, A., Sutherland, R.L., Hen-
shall, S.M., Fong, K., Pollack, J.R., Bowtell, D., Holloway, A.J.: An expression-
based site of origin diagnostic method designed for clinical application to cancer
of unknown origin. Cancer Res 65(10) (May 2005) 4031{40
12. Berlinet, A., Thomas-Agnan, C.: Reproducing Kernel Hilbert Spaces in Probability
and Statistics. Springer (2003)
13. Scholkopf, B., Smola, A.J.: Learning with Kernels. MIT Press (2002)
14. Guyon, I.: An introduction to variable and feature selection. JMLR 3 (Oct 2003)
1157{1182
15. Bedo, J., Sanderson, C., Kowalczyk, A.: An ecient alternative to svm based re-
cursive feature elimination with applications in natural language processing and
Page 12
hidden
12 Justin Bedo
bioinformatics. Proceedings of the Australian Joint Conference on Arti cal Intel-
ligence (2006)
16. Rifkin, R., Klautau, A.: In defense of one-vs-all classi cation. The Journal of
Machine Learning Research (Jan 2004)
17. Efron, B.: How biased is the apparent error rate of a prediction rule? Journal of
the American Statistical Association (Jan 1986)
18. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver
operating characteristic (roc) curve. Radiology 143(1) (Apr 1982) 29{36
19. Hand, D., Till, R.: A simple generalisation of the area under the roc curve for
multiple class classi cation problems. Machine Learning (Jan 2001)
20. Gabriel, K.: The biplot graphic display of matrices with application to principal
component analysis. Biometrika (Jan 1971)

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
67% Ph.D. Student
 
33% Post Doc
by Country
 
33% Brazil
 
33% France