Evaluation of Algorithm Performan...
Evaluation of Algorithm Performance in ChIP-Seq Peak Detection Elizabeth G. Wilbanks1,3, Marc T. Facciotti1,2,3* 1 Graduate Group in Microbiology, University of California Davis, Davis, California, United States of America, 2 Department of Biomedical Engineering, University of California Davis, Davis, California, United States of America, 3 Genome Center, University of California Davis, Davis, California, United States of America Abstract Next-generation DNA sequencing coupled with chromatin immunoprecipitation (ChIP-seq) is revolutionizing our ability to interrogate whole genome protein-DNA interactions. Identification of protein binding sites from ChIP-seq data has required novel computational tools, distinct from those used for the analysis of ChIP-Chip experiments. The growing popularity of ChIP-seq spurred the development of many different analytical programs (at last count, we noted 31 open source methods), each with some purported advantage. Given that the literature is dense and empirical benchmarking challenging, selecting an appropriate method for ChIP-seq analysis has become a daunting task. Herein we compare the performance of eleven different peak calling programs on common empirical, transcription factor datasets and measure their sensitivity, accuracy and usability. Our analysis provides an unbiased critical assessment of available technologies, and should assist researchers in choosing a suitable tool for handling ChIP-seq data. Citation: Wilbanks EG, Facciotti MT (2010) Evaluation of Algorithm Performance in ChIP-Seq Peak Detection. PLoS ONE 5(7): e11471. doi:10.1371/ journal.pone.0011471 Editor: Gert Jan C. Veenstra, Radboud University Nijmegen, Netherlands Received April 20, 2010 Accepted June 14, 2010 Published July 8, 2010 Copyright: �� 2010 Wilbanks, Facciotti. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by a National Science Foundation Graduate Research Fellowship to EGW and MTF���s start-up funds from UC Davis. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: mtfacciotti@ucdavis.edu Introduction Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is a technique that provides quantitative, genome-wide mapping of target protein binding events [1,2]. Identifying putative protein binding sites from large, sequence- based datasets presents a bioinformatic challenge that has required considerable computational innovation despite the availability of numerous programs for ChIP-Chip analysis [3,4,5,6,7,8,9]. With the rising popularity of ChIP-seq, a demand for new analytical methods has led to the proliferation of available peak finding algorithms. Reviewing literature from the past three years, we noted 31 open source programs for finding peaks in ChIP-seq data (Table S1), in addition to the available commercial software. The sheer abundance of available software packages and technical variability with which they identify protein binding sites makes an assessment of the current methods timely. An appraisal of available analytical methods will better equip researchers to bridge the ������next-generation gap������ between sequencing and data analysis [10]. Recently, Pepke et al. published a review of the major steps in ChIP-seq analysis and detailed the algorithmic approaches of 12 available programs for detecting peaks (the signals of putative protein binding) from ChIP-seq data [11]. For clarity, we have provided a brief overview of the main algorithmic treatments of ChIP-seq data however, our focus here is evaluative rather than purely descriptive. The purpose of this study is to provide an impartial analysis to help readers navigate the myriad of options. Laajala et al. [12] provide some metrics for evaluating different methods, but leave many areas unexplored. Our work offers several improved ways to assess algorithm performance and address the question: which of the available methods for ChIP-seq analysis should I consider using? The ChIP protocol ideally produces a pool of DNA fragments that are significantly enriched for the target protein���s binding site. High throughput sequencing of these fragments generates millions of short sequence ���tags��� (generally 20 to 50 bp in length) that are subsequently mapped back to the reference genome. By recognizing regions in the genome with increased sequence coverage, ChIP-seq experiments identify the genomic coordinates of protein binding events. ChIP-seq peak finders must discriminate these true peaks in sequence coverage, which represent protein binding sites, from the background sequence. When examining tag density across the genome, it is important to consider that sequence tags can represent only the 59-most end of the original fragment due to the inherent 59 to 39 nature of current generation of short-read sequencing instruments. This pattern results in a strand-dependent bimodality in tag density most evident in sequence-specific binding events, such as transcription factor-cis regulatory element binding (Figure 1). Most programs perform some adjustment of the sequence tags to better represent the original DNA fragment, either by shifting tags in the 39 direction [13,14,15] or by extending tags to the estimated length of the original fragments [16,17,18,19,20,21,22,23]. When the average fragment length can be accurately inferred (either computationally or empirically), the combined density will form a single peak where the summit corresponds closely to the binding site. If paired-end sequencing technologies are used, the fragment length can actually be measured directly allowing more precise determination of binding sites, a feature currently supported by only a handful of peak calling algorithms [13,24,25]. PLoS ONE | www.plosone.org 1 July 2010 | Volume 5 | Issue 7 | e11471
The first step in peak finding is to identify genomic regions with large numbers of mapped sequence tags. One approach to this task is to identify regions where extended sequence tags (XSETs) either overlap [19,21,26] or are found within some fixed clustering distance [16,20,22,27]. Another commonly used method for finding enriched regions calculates the number of tags found in fixed width windows across the genome, an approach known as a sliding window algorithm [13,15,18,23,28,29,30,31,32]. As this histogram-type density estimator can produce edge effects dependent on the window or bin size, some programs instead employ a Gaussian kernel density estimator (G-KDE) that generates a continuous coverage estimate [14,33,34]. All these methods specify some minimum height criteria at which enrichment is considered significant, and some minimum spacing at which adjoining windows, clusters or local maxima (G-KDE) are merged into a single peak region. Rather than searching for peaks in coverage, several methods leverage the bimodal pattern in the strand-specific tag densities to identify protein binding sites, either as their main scoring method [31,32] or in an optional post-processing filtering step [19,28]. Programs that use this signal exclusively, which we call ������directional scoring methods,������ are more appropriate for proteins that bind to specific sites (transcription factors), rather than more distributed binders, such as histones or RNA polymerase (Figure 1B). CSDeconv, a recently published algorithm, uses both G-KDE and directional information in conjunction with a deconvolution approach, which enables detection of closely spaced binding sites [34]. Such an approach has been shown to have higher spatial resolution, though the intense computational demands limit the size of genomes that can be analyzed. Developed expressly for use on a bacterial genome, CSDeconv and programs like it may represent an excellent choice for microbial ChIP-seq experiments with only a few binding sites, small genome size and high sequence coverage. More specialized programs for the analysis of RNA polymerase [35,36] and epigenetic modifications [37,38,39,40,41] ChIP-seq also have been developed. These proteins bind DNA over larger regions, producing relatively broad, low-intensity peaks that can be difficult to detect. Though we focus on identifying transcription factor binding sites from ChIP-seq data, we mention these additional methods should readers find them appropriate for their specific experiments. Peak finding programs must determine the number of tags (peak height) or directionality score that constitutes ������significant������ enrichment likely to represent a protein binding site. An ad hoc method for dealing with this issue is simply to allow users to select some threshold value to define a peak [16]. However, this simplistic approach does little to assist the user in assessing the significance of peaks and is prone to error. Other, more sophisticated methods assess the significance of sequence tag enrichment relative to the null hypothesis that tags are randomly distributed throughout the genome. The background modeled by the null hypothesis has been described previously using either a Poisson [15,32] or negative binomial model [28,30] parameterized based on the coverage of low-density regions in the ChIP sample. Figure 1. Strand-dependent bimodality in tag density. The 59 to 39 sequencing requirement and short read length produce stranded bias in tag distribution. The shaded blue oval represents the protein of interest bound to DNA (solid black lines). Wavy lines represent either sense (blue) or antisense (red) DNA fragments from ChIP enrichment. The thicker portion of the line indicates regions sequenced by short read sequencing technologies. Sequenced tags are aligned to a reference genome and projected onto a chromosomal coordinate (red and blue arrows). (A) Sequence- specific binding events (e.g. transcription factors) are characterized by ������punctuate enrichment������ [11] and defined strand-dependent bimodality, where the separation between peaks (d) corresponds to the average sequenced fragment length. Panel A was inspired by Jothi et al. [32]. (B) Distributed binding events (e.g. histones or RNA polymerase) produce a broader pattern of tag enrichment that results in a less defined bimodal pattern. doi:10.1371/journal.pone.0011471.g001 Testing of ChIP-Seq Algorithms PLoS ONE | www.plosone.org 2 July 2010 | Volume 5 | Issue 7 | e11471