Quantifying the extent of lateral...
Quantifying the Extent of Lateral Gene Transfer Required to Avert a ���Genome of Eden��� Leo van Iersel, Charles Semple and Mike Steel ? Department of Mathematics and Statistics, University of Canterbury Private Bag 4800, Christchurch, New Zealand Abstract The complex pattern of presence and absence of many genes across different species provides tantalising clues as to how genes evolved through the processes of gene genesis, gene loss and lateral gene transfer (LGT). The extent of LGT, particularly in prokaryotes, and its implications for creating a ���network of life��� rather than a ���tree of life��� is controversial. In this paper, we formally model the problem of quantifying LGT, and provide exact mathematical bounds, and new computational results. In particular, we investigate the computational complexity of quantifying the extent of LGT under the simple models of gene genesis, loss and transfer on which a recent heuristic analysis of biological data relied. Our approach takes advantage of a relationship between LGT optimization and graph-theoretical concepts such as tree width and network flow. Keywords: tree, phylogenetic network, lateral gene transfer, tree-width Email: l.j.j.v.iersel@gmail.com, c.semple@math.canterbury.ac.nz, m.steel@math.canterbury.ac.nz ? We thank the Allan Wilson Centre for Molecular Ecology and Evolution, and the New Zealand Marsden Fund for helping fund this work. arXiv:0911.1146v1 [q-bio.PE] 5 Nov 2009
2 Leo van Iersel, Charles Semple and Mike Steel 1 INTRODUCTION Modern sequencing technology is providing an increasingly detailed picture of the distribution of genes across a wide array of taxa. Some molecular biologists have used these data to argue that unless ancestral genomes were considerably larger than present-day ones, extensive lateral gene transfer (LGT) must be invoked to explain the current distribution of genes [1], [2], [12]. LGT is a process by which a gene (or genes) from one species is transferred into the genotype of another species by various genetic mechanisms. The extent of LGT is controversial, but it has been argued to be widespread in prokaryotes (e.g. bacteria) and during the earlier epochs of evolution, suggesting in turn that a network, rather than a tree, best describes the evolution of life [4]. Although the pattern of presence and absence of different genes across a set of species can suggest that LGT events occurred in the evolution of these species, another explanation is that certain genes are simply lost in different lineages. As a result, various attempts to quantify the extent of LGT based on gene content have been developed, typically based either on most-parsimonious scenarios or on stochastic models of gene genesis, loss and transfer (see, for example, [1], [10], [13]). Attempts to reconstruct evolutionary histories under the assumption that no LGT events have occurred (and that genes arise just once) imply that some common ancestors of the considered species must have had far more genes than their current-day descendants. Doolittle et al. [5] refer to such an unlikely all-encompassing ancestral genome as the ���genome of Eden��� hypothesis. Allowing LGT events reduces the need for genes to be present at earlier species, as illustrated for a single gene in Fig. 1. ...g... ...g... ���.... ���.... ...g... ���.... + * * * * time Figure 1. The dilemma of ancestral genome inflation: If gene g, distributed as shown, is not transferred laterally then under the model, g must be in five ancestral genomes (*,+) not just at +. In this paper, we exploit the combinatorial structure that underlies a key biological insight on which a recent heuristics analysis of data was based by [1] (see also [2], [12]). This insight is that simple models of gene evolution, in which a gene typically arises just once (gene genesis) but can be lost multiple times, imply lower bounds on the extent of LGT simply to prevent hypothetical ancestral genomes from becoming unfeasibly large. For such a model, we aim to bound the number of gene transfer events that have occurred in the evolution of a set of taxa, based on the presence/absence patterns of genes in each of these taxa, assuming that ancestral genomes are bounded by a given size.
Quantifying LGT 3 Notice that we wish to count transfer events (rather than the total number of genes that are transferred), since in each transfer event, several genes may be transferred from one species into another. Thus our count of LGTs is conservative, and recognizes that genes are not independently transferred and that a transfer event may insert a section of the genome (with several genes) into an individual organism of a different species. The structure of this paper is as follows. In the next section, we define the model of gene genesis, loss and transfer precisely, and summarize our main results. We then provide proofs of these results in subsequent sections, and end with some concluding comments and a conjecture. 2 MATHEMATICAL MODEL AND SUMMARY OF MAIN RESULTS 2.1 Definitions and model specification We begin by recalling some notation concerning digraphs, and phylogenetic trees and networks. Let v be a vertex of a digraph D. The indegree of v is the number of arcs directed into v, while the outdegree of v is the number arcs directed out of v. The indegree of v is denoted by d-(v) and the outdegree of v is denoted by d+(v). The degree of v is d-(v) + d+(v). Furthermore, u is an in-neighbour of v if (u,v) is an arc in D, while w is an out-neighbour of v if (v,w) is an arc in D. A digraph D is rooted if there exists a vertex, �� say, of indegree zero such that, for each vertex v in D, there exists a directed path from �� to v. Throughout the paper, X will denote a finite set of taxa and G will denote a finite set of genes. A phylogenetic tree (on X ) is a rooted tree whose root has degree at least two and all other internal vertices have degree at least three, and whose leaf set is X . More generally, a phylogenetic network N (on X ) is a rooted acyclic digraph with the following properties: (i) the root has outdegree at least two and, for all vertices v with d+(v) = 1, we have d-(v) ��� 2 and (ii) the set of vertices of outdegree zero is X . The elements of X are the leaves of N. For a subset U of the vertex set of N, the sub-digraph of N = (V,A) induced by U is the digraph whose vertex set is U, and whose arc set is the subset {(u,v) : u,v ��� U and (u,v) ��� A} of A. We now describe the model of gene genesis, loss, and transfer. For each taxon x ��� X , assume that the subset G(x) of G consisting of the genes in G that have been observed in taxon x is known. We refer to the associated map G : X ��� 2G as a genome assignment. Let N = (V,A) be a phylogenetic network on X . For a fixed positive integer k, and a genome assignment G : X ��� 2G, a (G,k)-gene labelling of N is a mapping F : V ��� 2G such that the following hold: (I) F(x) = G(x) for each x ��� X (II) |F(v)| ��� k for all v ��� V (III) For each gene g ��� G, the sub-digraph of N induced by {v ��� V : g ��� F(v)} is rooted (and therefore connected). Note that if x ��� X and |G(x)| k, then N has no (G,k)-labelling. If N has a (G,k)-labelling, we say that N exhibits such a labelling. A gene labelling describes a possible evolution of the genes observed in the taxa under consideration. Property (I) says that each leaf of the network is labelled by the set of genes observed in