Sign up & Download
Sign in

Simple SVM based whole-genome segmentation

by Justin Bedo, Geoff Macintyre, Izhak Haviv, Adam Kowalczyk
Nature Precedings (2009)

Cite this document (BETA)

Available from Geoff Macintyre's profile on Mendeley.
Page 1
hidden

Simple SVM based whole-genome segmentation

SIMPLE SVM BASED WHOLE-GENOME SEGMENTATION
JUSTIN BEDO, GEOFF MACINTYRE, IZHAK HAVIV, AND ADAM KOWALCZYK
Abstract. We present a support vector machine (SVM) based framework for DNA segmen-
tation into binary classes. Two applications are explored: transcription start site prediction
and transcription factor binding prediction. Experiments demonstrate our approach has sig-
ni cantly better performance than other methods on both tasks.
1. Introduction
Recently, there has been a substantial increase in annotations for the human genome. This
is in part due to the next-generation sequencing revolution that has substantially increased the
amount of data available. Now, high-quality annotations of the genome are available in a quantity
that was previously unheard of and the main challenge is to mine this data for useful information.
Classical approaches such as motif detection and position weight matrices (PWM) have mod-
elled biological processes through the collection of short sequences and generation of motifs using
alignment. These motif models are then used to predict activity on new genomic.
This paper presents a simple SVM based framework for segmenting and labelling the entire
genome into binary classes. We explore two applications, one in transcription start site (TSS)
prediction and the other in transcription factor binding site (TFBS) prediction. Experiments
show our simple approach outperforms another more complex SVM based predictor for TSS
prediction, and also outperforms the commonly used PWMs for TFBS.
2. Method
Let ~s 2 fa; g; t; cgn be the sequence under analysis and let ~l 2 f1; 1gn be a corresponding
label vector. We shall assume the minority class is given the positive label 1. Consider the
consecutive segments ~xi 2 fa; g; t; cg! and ~yi 2 f1; 1g! of xed length !. The problem is to
label each segment xi into one of two classes derived from the nucleotide labels ~l. In this paper
we shall consider labelling schemes of the form yi , [
P!
j=1 yij >  ].
There are three simple schemes arising from such labelling. The rst is to take  = 0. This
corresponds to labelling any segment containing a majority of positive nucleotides as a positive
segment, with the remaining segments forming the negative samples. This scheme would be
suitable for nucleotide labels ~l where the positive labels form contiguous blocks of large length
greater than the window size !.
The second scheme is to take  = !. This corresponds to labelling all segments containing
at least one positive base as a positive segment. This scheme can be used with both point and
contiguous block labels.
The nal scheme is to take = ! which corresponds to labelling only windows completely
contained within the positive segment as positive. This form of labelling may be useful for
applications with large positive regions or small window sizes !.
Features for learning were generated from the sequences ~xi using k-mers, which we de ne as
the frequency count of all sub-sequences of k length where k  !. We use  : fa; g; t; cgm ! N4
k
to denote the map between sequences and the k-mer feature space.
1
Na
tu
re
P
re
ce
di
ng
s :
d
oi:
10
.1
03
8/
np
re
.2
00
9.
38
11
.1
:
Po
ste
d
29
S
ep
2
00
9
Page 2
hidden
2 JUSTIN BEDO, GEOFF MACINTYRE, IZHAK HAVIV, AND ADAM KOWALCZYK
We used a linear least-squares support vector machine (SVM) [Scholkopf and Smola, 2002] to
classify each segment ~xi with binary labels yi. The linear prediction function for the ith segment
is thus1
f(~xi) ,
D
(~xi); ~
E
where ~ 2 R4
k
. The weights ~ are found by minimising the objective
arg max
~
(~ ) ,
1
2
X
i
max(1 yi
D
(~xi); ~
E
; 0)2 +
1
2
jj~ jj2;
where  is the regularisation hyperparameter. If we let X denote a matrix where the ith row is
the sample (~xi) in feature space and Y denote the vector [yi], then we can write  in matrix
form as
(~ ) ,
1
2
(X~ Y )T I(X~ y) +
1
2
jj~ jj2;
where I is a diagonal matrix with entries Iii , 1 yi
D
(~xi); ~
E
.
Minimisation of  can be done for small k easily in the primal domain. This comprises of
iterating
~ t+1 (X
T ItX~ t + )
1XT ItY
where  is a diagonal matrix with entries ii , . This is a variant of the well-known ridge-
regression solution [Hastie et al., 2001] with the additional I matrix. This is e ectively a descent
along the subgradient of .
For large k,  can still be minimised using a large-scale SVM learning algorithm such as the
Pegasos algorithm [Shalev-Shwartz et al., 2007]. All experiments in this paper were conducted
using the primal solver described above.
To evaluate model performance, two metrics were used. The rst is the receiver operating
characteristic (ROC) and the area under the ROC (AUC) [Hanley and McNeil, 1982]. The ROC
is de ned as the plot of the true positive rate (TPR) TP=(TP + FN) vs the false positive rate
(FPR) FP=(FP + TN) as the decision threshold is varied2. The AUC is the area under this
curve and has been shown to be equivalent to the probability of correctly ordering class pairs
[Hanley and McNeil, 1982]. I.e., the AUC of an hypothesis f can be calculated by
aroc(f) = P(~x;y);(~x0;y0)[f(~x) > f(~x
0)jy = 1 & y0 = 1]:
The second metric we use is the precision{recall curve (PRC). The precision is de ned as the
true discovery rate (TP=(TP + FP )) and the recall is equivalent to the TPR. Similarly to the
ROC, we take the area under the PRC (APRC) as a general measure of the performance across
all thresholds.
To reduce the number of features used in the model, we combined the SVM with recursive
feature elimination (RFESVM) [Guyon et al., 2002]. The RFESVM eliminates features by ob-
taining the SVM weights using the above procedure followed by discarding the feature(s) with
the smallest magnitude j ij. This process is then repeated recursively until a model of the de-
sired size is obtained. To accelerate the process, 10% of the worst features were discarded when
the model size was above 100 features and individually discarded when below. To optimise the
model size and regularisation parameter , 3-fold cross-validation [Hastie et al., 2001] was used
with a grid search for , and the model with greatest average APRC chosen.
1The typical bias term has been absorbed into the weights ~ for simplicity. This is a standard trick that
involves adding a constant feature of value 1 to all samples.
2TP , FP , TN and FN are the number of true positives, false positives, true negatives and false negatives
respectively.
Na
tu
re
P
re
ce
di
ng
s :
d
oi:
10
.1
03
8/
np
re
.2
00
9.
38
11
.1
:
Po
ste
d
29
S
ep
2
00
9

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
67% Post Doc
 
33% Ph.D. Student
by Country
 
33% Australia
 
33% France
 
33% United States