Sign up & Download
Sign in

The poisson margin test for normalization-free significance analysis of NGS data.

by Adam Kowalczyk, Justin Bedo, Thomas Conway, Bryan Beresford-Smith
Journal of computational biology a journal of computational molecular cell biology (2011)

Abstract

The current methods for the determination of the statistical significance of peaks and regions in next generation sequencing (NGS) data require an explicit normalization step to compensate for (global or local) imbalances in the sizes of sequenced and mapped libraries. There are no canonical methods for performing such compensations; hence, a number of different procedures serving this goal in different ways can be found in the literature. Unfortunately, the normalization has a significant impact on the final results. Different methods yield very different numbers of detected "significant peaks" even in the simplest scenario of ChIP-Seq experiments that compare the enrichment in a single sample relative to a matching control. This becomes an even more acute issue in the more general case of the comparison of multiple samples, where a number of arbitrary design choices will be required in the data analysis stage, each option resulting in possibly (significantly) different outcomes. In this article, we investigate a principled statistical procedure that eliminates the need for a normalization step. We outline its basic properties, in particular the scaling upon depth of sequencing. For the sake of illustration and comparison, we report the results of re-analyzing a ChIP-Seq experiment for transcription factor binding site detection. In order to quantify the differences between outcomes, we use a novel method based on the accuracy of in silico prediction by support vector machine (SVM) models trained on part of the genome and tested on the remainder. See Kowalczyk et al. 2009 for supplementary material.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

The poisson margin test for normalization-free significance analysis of NGS data.

The Poisson Margin Test for Normalisation Free
Significance Analysis of NGS Data
Adam Kowalczyk 1,2, Justin Bedo 1,2, Thomas Conway 1,3 & Bryan
Beresford-Smith 1,2
1NICTA, Victoria Research Laboratory, 2Department of Electrical and Electronic
Engineering, 3Department of Computer Science and Software Engineering,
The University of Melbourne, Parkville, VIC 3010, Australia.
Abstract. Motivation: The current methods for the determination
of the statistical significance of peaks and regions in NGS data require
an explicit normalisation step to compensate for (global or local) im-
balances in the sizes of sequenced and mapped libraries. There are no
canonical methods for performing such compensations, hence a number
of different procedures serving this goal in different ways can be found in
the literature. Unfortunately, the normalisation has a significant impact
on the final results. Different methods yield very different numbers of
detected “significant peaks” even in the simplest scenario of ChIP-Seq
experiments which compare the enrichment in a single sample relative to
a matching control. This becomes an even more acute issue in the more
general case of the comparison of multiple samples, where a number of
arbitrary design choices will be required in the data analysis stage, each
option resulting in possibly (significantly) different outcomes.
Results: In this paper we investigate a principled statistical procedure
which eliminates the need for a normalisation step. We outline its basic
properties, in particular the scaling upon depth of sequencing. For the
sake of illustration and comparison we report the results of re-analysing
a ChIP-Seq experiment for transcription factor binding site detection.
In order to quantify the differences between outcomes we use a novel
method based on the accuracy of in silico prediction by SVM-models
trained on part of the genome and tested on the remainder.
Availability: The supplementary material is available at [1].
Contact: Adam.Kowalczyk@nicta.com.au
1 Introduction
Current short read sequencing technology (routinely referred to as Next
Generation Sequencing or NGS) allows for genome wide scans for various
phenomena of interest, such as methylation, transcription factor binding
sites, etc. In order to derive meaningful results from the mapping of short
reads (or tags) to the reference genome, a number of statistical filters
based on the binomial distribution, Poisson distribution and their variants
Page 2
hidden
2have been proposed in the literature [2–5]. These methods are also very
similar to SAGE data analysis [6, 7], which also deals with short sequence
data.
In order to discern a meaningful signal from tags mapped to a refer-
ence genome, a number of biases have to be dealt with and corrected for,
as the signal “is actually the convolution of a number of effects: the den-
sity of mappable bases in a region, the underlying chromatin structure
and the actual signal from transcription factor binding” [2]. The natu-
ral way to mitigate these issues is to introduce control samples, so that
the detected signal is in the form of local enrichment of tag counts with
respect to the control. The closer the preparation and processing of the
control to the target sample, the more reliable the mitigation of biases.
However, experimenters cannot ensure that the target and control
samples are prepared and processed in a completely equivalent manner
and in practice the number of tags derived from two separate sequencing
reaction can differ by a significant factor. This situation becomes endemic
if one attempts to develop local models [2] compensating for local biases
along the reference DNA. In the simplest situation such as ChIP-Seq ex-
perimental detection of binding sites for a transcription factor [2], where
one attempts to detect enrichments in the target sample with respect
to a carefully prepared control, there is a plausible argument for scal-
ing the control sample counts to the level of the target, especially when
considering pre-filtered narrow regions of significant enrichment.
In practice, there are other scenarios where such scaling is less appli-
cable. An example is where one looks for the differential peaks between
two different tissue samples, e.g. differences in methylation between two
cell lineages [8, 5]. Here, even the direction of scaling (local or global) is
not obvious. Moreover, as argued in [5], the common level to which the
sample counts are adjusted has a profound impact on the statistical sig-
nificance of peaks when either Poisson or binomial models are used (see
Section 3.3), thus the number of detected peaks depends significantly on
the choices of scaling strategies.
The situation becomes even more cumbersome for experiments that
involve multiple samples, for example when quantifying the difference
between two cell lineages using pairs of samples collected from multiple
subjects in order to account for patient specific heterogeneity. One possi-
bility is to adjust all counts to a common size across the whole collection.
As we have noted, this common size impacts on the number of “significant
peaks” as in practice scaling could be by a factor of 2 or more with the
resulting variation in p-values being by a factor of 4 or more. Moreover,
Page 3
hidden
3the addition of new samples to the analysis could lead logically to read-
justment of the updated “common size” distorting the previous results.
The ad hoc nature of some of these adjustments undermines the princi-
pled statistical analysis, introducing arbitrary design choices and obscure
data driven adjustments.
In this paper we propose a different statistical technique that does not
require an explicit sample size adjustment and thus functions directly on
the original counts. Any adjustments can be used as an additional means
for accounting for other biases in the data. An example of this is the
known variations in the density of mappable tags (i.e., the effective depth
of sequencing) [2] along the DNA sequence.
2 Background
We now outline a conceptual model which can be used while reading this
paper. Using a specific protocol (sonication, enzymatic reactions cutting
the DNA at specific locations, protein immunoprecipitation selecting spe-
cific fragments of protein bound to them, etc.) a library L of DNA frag-
ments is prepared. From this library a subset R is randomly sampled and
for each of the sampled fragments a part of it, a short read, is sequenced
providing a tag, which is a k-mer of DNA bases. The tag is then mapped
to one or many matching locations in an a priori known reference DNA
sequence, the human genome in our case, using a specific protocol, e.g.
only exact matching, or only exact and unique, or only unique with up
to one error, etc. We are interested in the reference genome locations
where significant over/under-representation of the mapped tags occur, so
called peaks or peak ranges, as these can be interpreted as evidence for
some specific property of DNA or its epigenetic modifications. In some
experiments such as SAGE-Seq or Digital Karyotyping [8, 5] there are
natural peak regions, as the tags congregate at specific DNA locations
determined by the enzymes used to cut the source DNA. In other cases,
such as ChIP-Seq experiments using sonication, the peak regions have
to be determined from the data using specific algorithms (e.g. [2, 9]), or
perhaps just defined by a simple partitioning of the genome into uniform
small blocks, say of the order of a thousand bases.
In this section we assume that a set of peak regions of interest is given
to us. Let us consider a single peak range r. We denote by X = X(R) a
random variable of the count of tags mapped to r, and by x its particular
realisation. If we denote by λ, 0 < λ < 1, the proportion (fraction) of
reads in the library L with tags mappable to r, then X can be modelled
Page 4
hidden
4as a binomial random variable, Bin(λ, |R|):
P[x = X(R)] =
(
|R|
x
)
λx(1− λ)|R|−x, (1)
for x = 0, 1, ..., |R|. In a typical case of interest λ  1 and the distribution
of X is very well approximated by the Poisson distribution [10], Poi(µ):
P[x = X] = µx e
−µ
x!
, (2)
where x = 0, 1, ..., |R| and the Poisson rate is defined as µ = λ|R|.
3 The Poisson Margin Test
Now we focus on comparing two libraries, LA and LB, from which ran-
dom samples of mappable tags RA and RB were drawn, respectively. Sup-
pose we have observed counts xA and xB of tags mapped to the specific
peak range r of interest, and λA and λB are the corresponding propor-
tions of tags in the libraries LA and LB mappable to r, respectively. Let
(a, b) ∈ {(A,B), (B,A)} and the following relation for empirical propor-
tions holds:
λˆa :=
xa
|Ra|
< λˆb :=
xb
|Rb|
. (3)
How strong is this evidence for λa < λb? We approach this issue in a
typical statistical hypothesis testing manner. Namely, we are interested
in testing the alternative hypothesis (H1) that λa < λb versus the null
hypothesis (H0) that λa ≥ λb, i.e. that the complementary relation holds.
As a natural test statistic we can use the maximal probability of observ-
ing at least as extreme counts Xa, Xb under the null hypothesis (H0).
This probability we shall quantify in two different ways, a test based
on the binomial distribution (Binomial Margin, MBi) and its Poisson
approximation (Poisson Margin, MPo), respectively:
MBi(xa, xb) := sup
λa≥λb>0
P
[
Xa ≤ xa & xb < Xb

∣ Xi ∼ Bin(λi, |Ri|)
]
,
MPo(xa, xb) := sup
λa≥λb>0
P
[
Xa ≤ xa & xb < Xb

∣ Xi ∼ Poi(λi|Ri|)
]
, (4)
assuming the observed proportions relation (3) holds and, otherwise:
MBi(xa, xb) = MPo(xa, xb) := 1. (5)
Page 5
hidden
5In practice both tests are numerically equivalent, but MPo is easier to
handle analytically and computationally, and will be the primary focus
of the rest of this paper.
We observe that the above definitions do not require that the sample
sizes |RA| and |RB| be equal. By definition, the Poisson margin MPo is
the tightest universal upper bound on the following probability
P
[
Xa ≤ xa & xb < Xb & λa ≥ λb

∣ Xi ∼ Poi(λi|Ri|)
]
≤MPo(xa, xb).
In this sense it is a very conservative p-value, corresponding to the worst
case scenario test.
3.1 Computation of Poisson Margin
The following result facilitates the efficient numerical evaluation of MPo;
the proof is presented in the on-line Supplement [1]. Let
ρ := |Ra|
/
|Rb|, and χ := xa
/
xb.
Theorem 1. If the empirical relation for proportions (3) holds, then
MPo(xa, xb) = sup
0<µ
e−2µ
xa

i=0
(2µ)i
i!(1 + ρ)i

j>xb
(2µρ)j
j!(1 + ρ)j
, (6)
where the supremum is achieved for µ = µ∗, the only solution of the
reduced critical equation:
0 = E(µ) := ρ+ ρ
xa

i=1
i−1

j=0
(1 + ρ)(xa − j)




i=1
i

j=1
2µρ
(1 + ρ)(xb + j)
(7)
with the function E monotonically decreasing for µ > 0, from +∞ to
−∞. Moreover, if Eqn. 3 holds, then
ρ− χ+

(ρ− χ)2 + 4χρ(1 + ρ)

≤ µ∗
xb
≤ (1 + χ)(1 + ρ)
2ρ+ 1
+O
(
1
xb
)
,
(8)
where we use the“O”-notation for the negligible rounding term such that
lim
ε→0
|O(ε)/ε| < χ/4 + 1/2.
Page 6
hidden
6Equation (7) is easy to solve numerically, using Newton’s method for
example, even for large counts where xa, xb ∼ 105. The function E(µ) is
monotonically decreasing and the bounds (8) can be used for initialisation
of the solver iterations. The sums in (7) have quickly decaying terms, so
in practice they are reduced to a summation of only a few terms. One
of the aims of our derivation was to develop such a simplification and
to remove some very small nuisance factors that are below the computer
precision, say with log10 below −308 (= the limit of IEEE-754 double
precision).
Proof Outline. We express (4) explicitly as a 2-dimensional optimisa-
tion task:
MPo(xa, xb) = sup
0<λb≤λa
e−µa−µb
xa

i=0
µia
i!

j>xb
µjb
j!



∣µa=λa|Ra|
µb=λb|Rb|
, (9)
which can be reduced to the 1-dimensional optimisation
MPo(xa, xb) = sup
0<λ
e−µa−µb
xb

i=0
µia
i!

j>xb
µjb
j!



∣µa=λ|Ra|
µb=λ|Rb|
.
The latter task can be solved by finding a solution λ∗ of the critical
equation for the function of λ under “sup” above, which after some sim-
plifications, introduction of variable µ := λ |Ra|+|Rb|2 and removal of small
positive factors reduces to (7). The details are available on-line [1]. ut
Figure 2.A shows the numerical evaluation of MPo across a range
of counts which occur in practice. This figure clearly shows where the ef-
fects of the precision limits of IEEE-754 become apparent: the significant,
“dark blue” shaded part of the plot corresponds to p-values < 10−400. The
thousands of peaks in the experimental data discussed in Section 4 fall
into this region, see [1]. Note that the truncation of log10MPo at −500
is used in Figure 2.A purely for the purpose of visualisation.
3.2 Related Statistical Tests
The Poisson Margin test is closely related to some other statistical tests.
This relationship has been explored elsewhere (see [1][5]) and here we
mention only the Counts test and the Coin Toss test.
For xa < xb the Counts test is defined as
T∗(xa, xb) := sup
µa≥µb>0
P
[
Xa ≤ xa & xb < Xb

∣ Xi ∼ Poi(µi)
]
(10)
Page 7
hidden
7Fig. 1. A: Lower and upper bounds on µ∗ given by Eqn. (8) of Theorem 1 (broken
lines) and compared to the exact values µ∗ given by solution of Eqn. 7; here we show
averages for µ∗ as solid lines, the values for evaluation over xa-grid of 100 values
logarithmically spaced between 1 and 1000 and corresponding xb := xa/χ < 10, 000.
B: Computational validation of the Scaling Power Law given by Theorem 3. The plots
show clearly the asymptotical power scaling law (12): Tct(κxa, κxb) ≈ Tct(xa, xb)κ
translating to the linear dependence in the plots of the form xa 7→ log10 Tct(xa, xa/χ) ≈
xa ×A−1 log10 Tct(A,A/χ), where A 1 is a constant.
Fig. 2. A: Numerical evaluation of log10MPo(xa, xb) for the case of |Ra| = |Rb| and
B: of the relative difference (log10MPo − log10 Tct)/ log10MPo. The evaluation has
been done for the regular logarithmic grid of count values 1 ≤ xa, xb ≤ 10, 000.
Page 8
hidden
8for xa < xb and the Coin Toss test as
Tct(xa, xb) := P
[
X ≤ xa



X ∼ Bin
(
1
2
, xa + xb
)]
. (11)
The Coin Toss test was used in [3, 2]. The data from the latter paper
will be used for our experimental validation, hence, indirectly, we will
be comparing against this statistic, and a clarification of its relationship
to the Poisson Margin is appropriate here. Both papers have used Coin
Toss for computing the statistical significance of the difference between
two counts xa and xb in the same fashion as in the case of MPo, but
for the special case when libraries have equal, or equalised, sizes, namely
|Ra| ≈ |Rb|. In this special case they are “equivalent” to the MPo test
in the following sense.
Theorem 2. If |Ra| = |Rb| and the empirical proportion relation (3)
holds, then
MPo(xa, xb) = T∗(xa, xb) ≈ Tct(xa, xb).
Proof outline. If |Ra| = |Rb|, then λa < λb is equivalent to µa :=
λa|Ra| < µb := λb|Rb|, hence the equivalence MPo(xa, xb) = T∗(xa, xb)
follows from the definitions (4) and (10). The approximation by Tct has
been argued in [5]; here we demonstrate it by numerical evaluation pre-
sented in Figure 2. ut
Figure 2.B shows a numerical evaluation of the differences between
log10MPo||Ra|=|Rb| and log10 Tct over a grid of values of xa and xb which
occur in practice. Note the differences in the shading scales used in the
two sub-figures. The difference | log10MPo − log10 Tct| is < 2, hence in
the areas of significant values of log10MPo, say < −100, it composes
practically negligible correction of ≤ 1%. This also allows the extension
of the scaling properties Tct discussed below to the case of MPo||Ra|=|Rb|.
3.3 Scaling Properties
The following result can be shown formally (see [5]).
Theorem 3 (Scaling Power Law). Let 0 < 2xa < xb be two integers
and κ > 1. Then
log Tct(κxa, κxb) = κ log Tct(xa, xb) +
o(xb)
xb
, (12)
where o(xb)xb → 0 for xb →∞ denotes a ‘negligible’ correction.
Page 9
hidden
9The computational validation of this result and implied practical exten-
sion to the whole range of values xb > 0 is presented in Figure 1.B which is
sufficient for our discussion below. Plots in there show clearly the asymp-
totical power scaling law (12): Tct(κxa, κxb) ≈ Tct(xa, xb)κ translating
to the linear dependence in the plots of the form log10 Tct(xa, xa/χ) ≈
xa ×A−1 log10 Tct(A,A/χ), where A  1 is a constant.
The above Theorem and Figure 1.B facilitate a discussion of two im-
portant issues in the analysis of the NGS data, namely, (i) the impact of
the number of lanes use by the sequencing machine to map the library
and (ii) scaling of the counts, in the case of typically unequal sizes of the
sequenced and then mapped tag sets. Firstly, they tell us that having κ
times more reads sequenced, e.g. using κ lanes in the sequencing machine
rather than one, will provide an exponentially stronger (i.e. smaller, ex-
ponentiated by κ) p-values, asymptotically for large xb. This can be used,
for example, as guidance for selection of more or fewer lanes in an NGS
experiment.
However, those results also point to fragility of “count scaling”. More
precisely, if the numbers of reads actually sequenced and mapped are
significantly different, |Ra| 6≈ |Rb|, then either we need statistical tests
that are intrinsically immune to such differences or we need to normalize
the counts in a candidate peak region to make them comparable. The
MPo test (4) falls in fist category while the Tct tests represents the latter.
4 Experimental Validation
It is far from clear that the postulated statistical test will provide useful
results in practice. Simply put, the worst case scenario embraced in defi-
nition (4) may be too conservative and the generated p-values too close to
1 be informative. In order to address this concern, we have decided to fo-
cus on the well studied NGS application of ChIP-Seq peak calling, which
involves comparison of only two libraries (the target and the dedicated
control). The more complex problem of differential analysis of multiple
libraries will be addressed in future work.
In order to validate our method we have used public domain ChIP-
Seq data. In particular, [2] provides 36,998 putative locations/regions for
binding STAT1 and 24,739 locations/regions for Pol II. Note that both
databases in [2] have been used as the basis of the most recent ENCODE
data in their correponding domains, hence we have no alternative ‘gold
standards’ to evaluate the results of our analysis. In this paper we deal
with this obstacle by using a procedure evaluating the results of analysis
Page 10
hidden
10
by checking their internal consistency. This procedure is outlined below
in Section 4.3.
Since [2] also provides the Eland mappings of the tags for the controls,
STAT1 and Pol II data sets we have performed an additional independent
analysis of this data using Poisson Margin outlined in the following two
sections.
4.1 Re-ordering
We have used the list of peak ranges, peak locations and range boundaries
exactly as in [2]. For each range we have extracted (raw) counts, follow-
ing the protocol described in the original paper and then used the MPo
statistic to allocate the p-value (see Supplementary Table 1 for STAT1
and Table 2 for Pol II). Although the order according to theMPo method
is only slightly different from the original, the overlap is > 90%, see Ta-
ble 1, the differences in performance benchmarks especially for STAT1
are significant.
4.2 De-novo Significant Range Location
In our experiments described below we have implemented and run a fixed
size sliding window method across the genome comparing the number of
tags in each window for the control and target samples, for each sample
for both DNA strands pooled together. Regions of significance are then
defined by thresholding the p-values obtained from the Poisson Margin
test. This resembles the approach in [11]. This effectively separates the
tasks of finding regions of the genome where we believe a peak lies and
determining the location of the antibody binding site itself, the former
task being done efficiently on the whole genome scale and the latter be-
ing done intensively on just those regions identified by the genome wide
scan as containing a peak. The genome was scanned sequentially with a
window of 200 bp, shifted every 4 bases, which took about 17 minutes on
a workstation with 2GHz Opteron CPUs and 32Gb of main memory.
We have identified 35,229 peak ranges for STAT1 using un-adjusted
p-values <1E-4 and 28,890 using un-adjusted p-values <1E-6 for Pol II,
with thresholds chosen to match the numbers of peaks in the original
publication. The range was defined as a contiguous region of 200 BP
blocks which passed the threshold.
Such a procedure can be followed by refinements of boundaries and
more precise location of the range boundaries and more precise peak
Page 11
hidden
11
location. Example of such secondary adjustments can be found in [2, 11,
12], but we used none here.
4.3 Genome Annotation Test
We wish to quantify consistency of the putative peak ranges by quantify-
ing ability to predict those locations on the whole genome by a predictor
trained on a part of the genome. In our experiments peak ranges from
chromosome 22 were used for training and those from the remaining chro-
mosomes were used only for testing. For quantifying prediction accuracy
we have adapted protocol 1B in [13–15]. In brief, the genome is divided
into 500bp non-overleaping segments (total of 5,362,342 segments not
containing ‘N’s), with each segment labeled as positive if it overlaps a
peak range and negative otherwise (the positive segments comprise < 1%
of the total number of segments). These labels were used to build a pre-
dictor and verify its performance measured by precision and recall. We
recall that “Precision” means ratio of true positive retrieved (for a given
decision threshold) to number of retrieved cases, and “Recall” is the num-
ber of true positive retrieved divided by the total number of true positive
cases.
A linear support vector machine (SVM) was developed to label each
500bp segment independently [15]. Each 500bp segment was represented
as a feature vector containing frequency counts of 4-mers contained within.
Recursive feature elimination was used to reduce the model’s number of
features and other meta-parameters were set to maximise the area un-
der precision-recall curve (PRC) in an internal cross-validation on the
training data.
This method works very well for some tasks such as the prediction
of transcription start sites and the binding of some transcription factors
(e.g., c-Myc) [15], and seems to significantly outperform other methods
such as standard position weight matrices (PWM). From our experience,
STAT1 is one of the harder transcription factors to predict, however we
still observe much higher performance using the SVM predictor than with
PWMs [15].
4.4 Results
Table 1 and Figure 3 summarise results for few its variations of genome
annotation experiment described above, for STAT1 and Pol II data, re-
spectively. We recall, the data from chromosome 22 was used for training
exclusively and test results reported are for data from the whole genome.
Page 12
hidden
12
The following four basic variations of experiment consisted in usage of
different sets of peaks:
– I: The top 50% of the original list in [2];
– II: The top 50% of the list in [2] after sorting by MPo method;
– III: The whole 100% list in [2];
– IV: A de novo list of peaks derived as outlined in Section 4.2.
Additionally, for STAT1 we have also tested a typical position weight
(PWM) matrix from TRANSFAC r 7.0. This is reported as row “V” in
Table 1. In this case the “score” per range in [2] is defined as the max of
PWM scores for all positions within a 500BP tile. The overlap in Table 1
was calculated using the top 37k tiles as scored by the PWM.
We observe that for variants I and II for STAT1, the area under the
PRC curve (4.0%) is approximately 1.5 times that for the original [2] or-
dering (2.6%). For Pol II the corresponding difference is smaller, due to
larger similarity of the ordered lists and much higher accuracy of predic-
tions, but differences between de-novo (blue solid) and 100% list in [2]
(broken blue) in Figure 3.B is well pronounced.
5 Discussion
The basic protocol described in this paper can be extended in a number of
directions. In the experiments we have used only uniquely mapped reads.
The density of such reads varies along the genome, which can affect the
relative p-values for peaks at different locations since both MPo and the
Coin Tossing statistic (Tct) [2] are sensitive to the sequencing depth (see
Figure 1.B). A simple way around this obstacle is to scale up the observed
counts (per range) inversely to the fraction of mappable tags for the re-
gion. Such information for uniquely mapped tags of length 30 is provided
in [2], but information of uniquely mapped up to 2 mismatches (which we
prefer) is not currently available (see comment in [2, Supplement]). We
have not used this correction here.
We present an alternative to the method introduced previously, e.g.
[2], [4] or [11]. We have focussed on the first of those references since it is
one of the most recent and allows access to good quality of experimental
data, included in ENCODE. We have shown that a principled analysis of
such data using our method is feasible with minimal need for (arbitrary)
design choices and with a minimal number of data-adjustable parameters.
(An illustrating example here is the introduction in [2, p. 73] of an ad
hoc parameter 0 ≤ Pf ≤ 1 for the fraction of putative highest peaks
Page 13
hidden
13
Table 1. Summary of ordered peaks lists overlap with the list in [2] and the accuracy
prediction of binding site on the whole genome. In experiments I-IV we use data on
chromosome 22 for training SVM exclusively. All values listed are in %.
List overlap (%) with [2] Prediction for whole genome
Number of top peaks Area under Prec. at recall
Experiment 10% 20% 50% 100% Prec.-Rec. curve 10% 20%
STAT1
I: Roz.50% 100 100 100 - 2.6 7.6 4.2
II: Roz.+MPo50% 96 94 91 - 4.0 12.3 7.5
III: Roz.100% 100 100 100 100 5.5 17.1 8.0
IV: deNovo-MPo 68 69 70 83 5.7 18.2 8.2
V: Roz.100% PWM 3 5 9 12 4.1 8.4 4.4
Pol II
I: Roz.50% 100 100 100 - 24.4 56.0 50.1
II: Roz.+MPo50% 53 85 95 - 25.6 57.4 51.7
III: Roz.100% 100 100 100 100 23.2 58.8 51.2
IV: deNovo-MPo 66 71 74 77 26.9 62.4 56.1
Fig. 3. Precision–Recall curves for STAT1 (A) and Pol II (B) corresponding to Ta-
ble 1. We report test results for the whole genome for the variants I-V of the genome
annotation test experiment described in Section 4.4.
Page 14
hidden
14
to be excluded from regression for local normalisation of counts). Our
approach is robust, and can be applied to a wide variety of experimental
designs involving different numbers of samples, possibly from different cell
lineages. It is applicable precisely because it does not require scaling of
the individual libraries of reads. The results in [5] and Section 3.3 show
that such a scaling, if applied, should be done with extreme caution, if
the analysis is to be meaningful.
6 Conclusions
We have developed a principled statistical test for the detection of sig-
nificant reads concentrations which is directly applicable to libraries of
different (unmatched) sizes without any scaling of read counts and have
demonstrated that such a scaling could introduce significant bias in the
computed p-values. Although our statistical test targets differential anal-
ysis for multiple NGS libraries, the initial validation in this paper is re-
stricted to the simplest case of comparison of a target library to a match-
ing reference. Using the recent Encode Chip-Seq data we have shown
that our test delivers non-vacuous results, with peak calling accuracies
comparable or even improved with respect to the the original dedicated
algorithm. The absence of adequate gold standards for benchmarking was
circumvented by application of a novel internal consistency check based on
the accuracy of generalisation of a supervised learning predictor. Demon-
stration of the utility of that protocol is the second major contribution
of this paper.
Acknowledgements
NICTA is funded by the Australian Government’s Department of Com-
munications, Information Technology and the Arts, the Australian Re-
search Council through Backing Australia’s Ability, and the ICT Centre
of Excellence programs.
References
1. Kowalczyk, A., Bedo, J., Conway, T., Beresford-Smith, B.: Poisson Margin Test for
Normalisation Free Significance Analysis of NGS Data - Supplementary Materials.
(2009) http://www.genomics.csse.unimelb.edu.au/peakfiltsup.
2. Rozowsky, J., Euskirchen, G., Auerbach, R., Zhang, Z., Gibson, T., Bjornson, R.,
Carriero, N., Snyder, M., Gerstein, M.: Peakseq enables systematic scoring of
chip-seq experiments relative to controls. Nature Biotechnology 27 (2009) 6675
Page 15
hidden
15
3. Nix, D., Courdy, S., Boucher, K.: Empirical methods for controlling false positives
and estimating confidence in chip-seq peaks. BMC Bioinformatics 9 (2008) 523
4. Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Eu-
skirchen, G., Bernier, B., Varhol, R., Delaney, A., et al.: Genome-wide profiles
of STAT1 DNA association using chromatin immunoprecipitation and massively
parallel sequencing. Nature Methods 4 (2007) 651–657
5. Kowalczyk, A.: Some Formal Results for Significance of Short Read Concentrations
. (2009) http://www.genomics.csse.unimelb.edu.au/shortreadtheory.
6. Baggerly, K.A., Deng, L., Morris, J.S., Aldaz, C.M.: Differential expression in
SAGE: accounting for normal between-library variation. Bioinformatics 19(12)
(2003) 1477–1483
7. Robinson, M., Smyth, G.: Moderated statistical tests for assessing differences in
tag abundance. Bioinformatics 23(21) (2007) 2881–2887
8. Bloushtain-Qimron, N., Yao, J., Snyder, E.: Cell type-specific dna methylation
patterns in the human breast. PANS 105 (2008) 1407614081
9. Zang, C., Schones, D.E., Zeng, C., Cui, K., Zhao, K., Peng, W.: A clustering
approach for identification of enriched domains from histone modification ChIP-
Seq data. Bioinformatics 25(15) (2009) 1952–1958
10. Keeping, E.: Introduction to Statistical Infernce. Dover, ISBN 0-486-68502-0
(1995) Reprint of 1962 edition by D. Van Nostrand Co., Princeton, New Jersey.
11. Zhang, Y., Liu, T., Meyer, C., Eeckhoute, J., Johnson, D., Bernstein, B., Nuss-
baum, C., Myers, R., Brown, M., Li, W., Liu, X.S.: Model-based analysis of chip-seq
(macs). Genome Biology 9(9) (2008) R137+
12. Ji, H., Jiang, H., Ma, W., Johnson, D., Myers, R., Wong, W.: An integrated
software system for analyzing ChIP-chip and ChIP-seq data. Nature Biotechnology
26 (2008) 1293–1300
13. Sonnenburg, S., Zien, A., Ratsch, G.: Arts: accurate recognition of transcription
starts in human. Bioinformatics 22(14) (2006) e423–e480
14. Abeel, T., Van de Peer, Y., Saeys, Y.: Toward a gold standard for promoter
prediction evaluation. Bioinformatics 25(12) (2009) i313–320
15. Bedo, J., MacIntyre, G., Haviv, I., Kowalczyk, A.: Simple SVM based
whole-genome Segmentation. (2009) Available from Nature Precedings
http://dx.doi.org/10.1038/npre.2009.3811.1.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

10 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
30% Post Doc
 
20% Researcher (at an Academic Institution)
 
20% Student (Master)
by Country
 
20% Hong Kong
 
10% Australia
 
10% United Kingdom