Sign up & Download
Sign in

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

by Jens Keilwagen, Jan Grau, Stefan Posch, Ivo Grosse
BMC Bioinformatics ()

Abstract

Background: One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. Results: With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. Conclusions: We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.

Cite this document (BETA)

Available from www.biomedcentral.com
Page 1
hidden

Apples and oranges: avoiding diff...

RESEARCH ARTICLE Open Access Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis Jens Keilwagen1*���, Jan Grau2���, Stefan Posch2, Ivo Grosse1,2 Abstract Background: One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. Results: With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. Conclusions: We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis. Background The computational recognition of short signal sequences in genomic DNA is one of the prevalent tasks in bioin- formatics. It includes e.g. the recognition of transcrip- tion factor binding sites (TFBSs) [1,2], donor or acceptor splice sites [3-5], nucleosome binding sites [6,7], or binding sites of insulators like CTCF [8]. Many different algorithms have been developed for the recog- nition of such DNA binding sites, with specific strengths and weaknesses, but none of them is perfect. Hence, great efforts have been made over the last decade to evaluate and compare the performance of different algo- rithms [2,3,9-13]. The results of such comparative stu- dies are often influential to the direction of future research, because they lead to new and superior approaches by combining the advantages of existing algorithms and because they provide a deeper under- standing of the mechanisms of protein-DNA interaction. The approaches compared typically differ by (i) the sta- tistical model employed at the heart of these algorithms, (ii) the learning principle chosen for estimating the model parameters, and (iii) the prior used for the para- meters of the model, and it is non-trivial to keep the influences of these different contributions apart. The first two aspects focus on developing improved statisti- cal models or learning principles, while the choice of * Correspondence: Jens.Keilwagen@ipk-gatersleben.de ��� Contributed equally 1Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany Keilwagen et al. BMC Bioinformatics 2010, 11:149 http://www.biomedcentral.com/1471-2105/11/149 �� 2010 Keilwagen et al licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
hidden
the prior is often arbitrary or determined by conjugacy. However, the choice of the prior may have a decisive effect on the recognition performance [14,15]. The goal of this paper is to derive a common prior for Markov random fields (MRFs) and mixtures of MRFs, which are at the heart of many existing algorithms for binding site recognition, allowing an unbiased comparison of differ- ent learning principles for models from this model family. Many computer algorithms available today use statisti- cal models for representing the distribution of sequences, and many of these statistical models are spe- cial cases of MRFs [16,17]. These models range from simple models like the position weight matrix (PWM) model [1,18,19], the weight array matrix (WAM) model [4,6,20], or Markov models of higher order [21,22] to more complex models like moral Bayesian networks [2,12,23] or general MRFs [5,24,25]. Hence, we restrict our attention to statistical models from the family of MRFs in this paper. One of the first learning principles used in bioinfor- matics is the maximum likelihood (ML) principle. How- ever, for many applications, the sequence data available for learning statistical models is very limited. This is especially true for the recognition of TFBSs, where typi- cal data sets contain sometimes as few as 20 and seldom more than 300 sequences. For this reason, the ML prin- ciple often leads to suboptimal classification perfor- mance e.g. due to zero-occurrences of some nucleotides or oligonucleotides in the training data sets. The maxi- mum a-posteriori (MAP) principle, which applies a prior to the parameters of the models, establishes a theoretical foundation to alleviate this problem and at the same time allows for the inclusion of prior knowledge aside from the training data. Recently, the application of discriminative principles instead of generative ones has been shown to be promising in the field of bioinformatics [9,21,22,24,26]. Generative learning principles aim at an accurate repre- sentation of the distribution of the training data, whereas discriminative learning principles aim at an accurate classification of the training data. The discriminative ana- logue to the ML principle is the maximum conditional likelihood (MCL) principle, which has been widely used in the machine learning community [27-31]. However, the effects of limited data may be even more severe when using the MCL principle compared to generative learning principles [11]. To overcome this problem, the maximum supervised posterior (MSP) principle [32,33] has been proposed as discriminative analogue to the MAP principle. Many different priors have been used in the past, and their choice seems arbitrary or motivated by technical aspects. Product-Gaussian and product-Laplace priors are widely used for generatively trained MRFs [16] and discriminatively trained MRFs also called conditional random fields [17,34]. For the generative MAP learning of Markov models and Bayesian networks, the most pre- valent prior is the product-Dirichlet prior, whereas for the discriminative MSP learning, either a product-Gaus- sian or product-Laplace prior is typically employed [26]. Hence, when comparing generatively and discrimina- tively trained Markov models, Bayesian networks, and MRFs, in many occasions apples are compared to oranges by using different priors. The comparison of generative and discriminative learning principles is the topic of several recent studies. Ng & Jordan [11] compare generatively and discrimina- tively trained PWM models. To be specific, they com- pare the Bayesian MAP principle with the non-Bayesian MCL principle. Pernkopf & Bilmes [30] compare the ML principle to the MCL principle for estimating the parameters of Bayesian networks, while the structures of the networks are estimated by generative as well as dis- criminative measures. Greiner et al. [29] compare the ML principle with a variant of the MCL principle that prevents over-fitting, and they apply these approaches to Bayesian networks. Grau et al. [26] compare the MAP principle for Markov models using a product-Dirichlet prior to the MSP principle using product-Gaussian and product-Laplace priors. All of these studies use different priors when compar- ing different learning principles, rendering the conclu- sions regarding the superiority of one learning principle over the other questionable, because the differing influ- ences of these priors are neglected. In fact, we are not aware of any study that uses the same a-priori informa- tion when comparing generative to discriminative learn- ing principles. Motivated by this lack of consistency, we aim at estab- lishing a prior that i) can be used for the generative (MAP) and the dis- criminative (MSP) principles, ii) is conjugate to the likelihood of MRFs, which include moral Bayesian networks, iii) contains the widely-used product-Dirichlet prior as special case when the structure of the MRF is equivalent to that of a moral Bayesian network including all of its special cases such as PWM mod- els, WAM models, Markov models of higher order, or Bayesian trees. In section Methods, we present the derivation of such a prior, which is the main result of this paper. With such a prior at hand, it becomes possible to accomplish an unbiased comparison of generative and discriminative learning principles applied to the same model using the Keilwagen et al. BMC Bioinformatics 2010, 11:149 http://www.biomedcentral.com/1471-2105/11/149 Page 2 of 13

Authors on Mendeley

Readership Statistics

22 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
32% Post Doc
 
27% Ph.D. Student
 
14% Student (Master)
by Country
 
18% Germany
 
14% United Kingdom
 
14% Spain

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in