Sign up & Download
Sign in

Mobile genetic elements and their prediction

by Morgan G I Langille, Fengfeng Zhou, Amber Fedynak, William W L Hsiao, Ying Xu, Fiona S L Brinkman
Computational Methods for Understanding Bacterial and Archaeal Genomes (2008)

Cite this document (BETA)

Available from Morgan Langille's profile on Mendeley.
Page 1
hidden

Mobile genetic elements and their prediction

June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
CHAPTER 5
MOBILE GENETIC ELEMENTS AND THEIR PREDICTION
MORGAN G.I. LANGILLE, FENGFENG ZHOU, AMBER FEDYNAK,
WILLIAM W.L. HSIAO, YING XU and FIONA S.L. BRINKMAN
1. Introduction
Mobile genetic elements (MGEs) are regions of DNA that are able to move
themselves throughout the genome of a single organism or between organisms. These
elements all share three common hurdles to their proliferation. First, the genetic
element must be excised or transcribed from the host genome into either an RNA or
DNA molecule. Second, that element must be transmitted between organisms via
horizontal gene transfer (HGT) or within an organism and be ready for integration
as a DNA molecule. Third, the element is then integrated into a replicon in a
new location. The mobile elements of prokaryotes, including but not limited to
prophage, transposons, integrons, insertion sequences and genomic islands, use
various mechanisms to overcome these obstacles. These elements form the basis
of important mechanisms of evolution that result in the transfer, rearrangement
or deletion of genes. In this chapter, we discuss the features associated with these
various types of mobile elements, and methods used for computationally identifying
them, focusing on the most common, chromosomally-integrated elements that can
be computationally predicted. Rare mobile elements and those MGEs that are not
integrated into the genome, such as plasmids, are discussed in less detail due to the
limited literature of their computational prediction.
2. Features of Mobile Elements
2.1. Genomic Islands
The majority of the prokaryotic genomes sequenced to date appear to be littered
with horizontally acquired DNA fragments, termed genomic islands (GIs). GIs are
commonly defined as clusters of genes in prokaryotic genomes that are thought to
have originated from a horizontal transfer event. It is not clear whether GIs should
be considered a separate MGE from others such as transposons and prophage, or
rather that GIs encompass all of these MGEs as sub-classes. However, in many
cases the transmission mechanism of these genetic elements is not obvious due
113
Page 2
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
114 M. G. I. Langille et al.
to mutations that have obfuscated or destroyed the transmission or integration
mechanisms; therefore, reducing the capability to classify these regions in more
detail. In this section, we will focus mostly on the features of GIs that are related
to their identification.
The name GIs is derived from pathogenicity islands (PAIs) (Hentschel
et al., 2001), which were first identified in uropathogenic Escherichia coli as
large (>100kb), relatively unstable (excision frequencies ∼10−4 to 10−6) regions
containing clusters of virulence associated genes (Knapp et al., 1986; Hochhut et al.,
2006). As more genomes were sequenced, it became clear that genetic elements
which share similar structural features with PAIs can encode other important
functions (new metabolic capabilities, etc.). Hence PAIs were grouped with other
similar elements and referred to collectively as GIs. GIs appear to contribute to the
adaptation of microbes in two ways. First, genes acquired in GIs have been shown
to allow the microbes to explore new niches and to improve fitness. For example,
many rhizobia species harbour symbiotic islands containing nitrogen fixation and
nodulation genes to allow their interaction with plant hosts (Sullivan et al., 1998).
For pathogens, GIs encoding iron uptake functions, type III secretion systems,
toxins, and adhesins augment their abilities to survive and cause diseases in the host
(Dobrindt et al., 2004; Gal-Mor et al., 2006). The second type of contribution of
GIs to microbial adaptation is less well studied but may play an equally important
role. New studies are emerging that show selective loss and possible regaining of
islands may provide an additional means to modulate pathogenicity (Lawrence,
2005; Manson et al., 2006). Spontaneous excisions of PAIs have been observed in
various pathogens resulting in distinct pathogenic phenotypes compared to wild
types (Bueno et al., 2004; Middendorf et al., 2004). In the case of Salmonella
enterica serovar Typhi pathogenicity island 7, called SPI7, deletion of this GI is
associated with more rapid invasion in-vitro and reduced resistance to complement
attack (Bueno et al., 2004). As the genetic requirements for initiation of infection
and long-term infection can be quite different, the capability to lose or alter certain
genes, such as surface antigens, after the initial infection has been postulated as
a means to establish long term colonization and avoid immune detection (Finlay
et al., 1997; Gogol et al., 2007).
GIs share some sequence and structural features that help to distinguish them
from the rest of a given prokaryotic genome. These features are summarized below
and in Table 1.
First, GIs are sporadically distributed in closely related species or strains of
the same species. For example, most PAIs are present in pathogen genomes but are
absent from their non-pathogenic relatives. However, it is important to keep in mind
that the concept of virulence is context specific and a particular virulence factor
(e.g. factors involved in iron-uptake) may contribute to pathogenic potential in one
species but act as important factor for survival and replication in other ecological
niches not susceptible to infection. In such nonpathogenic hosts or environments,
Page 3
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 115
Table 1. List of features associated with genomic and pathogenicity islands.
Feature associated with GIs Possible method(s) to detect these features
Sporadic distribution Comparative genomics to identify unique and
shared regions
Sequence composition bias Various tools have been developed to detect bias
(see section Detection of Genomic Islands)
Adjacent to tRNA Detect full or partial tRNA using BLAST (Basic
Local Alignment Search Tool) (Altschul et al.,
1997) or tRNAscan-SE (Lowe et al., 1997)
Usually relatively large (>10 kb) Comparative genomics to identity large
insertions
Contain genes of unknown
functions
Compare to functional databases such as COG
(Clusters of Orthologous Groups) (Tatusov
et al., 1997)
Contain mobility genes or elements Similarity search of mobility genes using Hidden
Markov Models (HMMs) or BLAST
Flanked by direct repeats Use repeat finders such as REPuter (Kurtz
et al., 1999) to identify repeats
Unstable and can excise
spontaneously
Comparative genomics to identify unique
regions; targeted PCR or hybridization to
detect altered regions
these islands may more appropriately be called “fitness islands” or “ecological
islands” (Hacker et al., 1997).
Second, GIs often exhibit sequence composition bias compared to the core
genome. The classic measure of sequence composition bias is G + C content
(%G + C). However, due to its limited sensitivity (Hsiao et al., 2005), additional
measures using oligonucleotides (k-mers), have been more recently used. Since the
majority of a given genome exhibits consistent sequence composition, the average
composition from the entire genome is often used as a substitute for the core
genome.
Third, GIs are frequently found adjacent to tRNA genes or flanked with direct
repeats (Hacker et al., 1997). tRNA genes are known phage integration sites and
therefore may serve as integration sites for MGEs that become PAIs (Reiter et al.,
1989). GIs that use tRNAs as insertion sites often carry an “identity block” of several
nucleotides-long that is identical to the 5′ or 3′ end of a tRNA; and upon insertion,
the tRNA is reconstituted by the identity block generating a pair of direct repeats
(one from the identity block and the other from the tRNA gene) at the opposite
ends of the inserted fragment (Williams, 2002).
The sites have the added benefit of being highly conserved. tRNA genes are
often reconstituted upon insertion or excision, and as a result, GIs do not abrogate
tRNA function.
Fourth, most GIs discovered to date are relatively large, ranging from 10 to
200kb (Hacker et al., 1997). They often contain clusters of functionally synergetic
Page 4
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
116 M. G. I. Langille et al.
genes leading to the formation of the selfish operons hypothesis (Lawrence et al.,
1996). This hypothesis postulates that HGT provides a mechanism for weakly
selected clusters of functionally-synergetic genes to spread and survive better than
unlinked genes since horizontal transfer is size-limited. In the long run, this selective
pressure leads to operon structures common in prokaryotes.
Fifth, while many GIs have been characterized, previous studies have shown
that genes within islands disproportionately contain genes with no known homologs
or with unknown function. In addition, while certain gene functional classes such
as cell surface proteins, host-interaction proteins, and DNA-binding proteins are
more often observed in GIs (Nakamura et al., 2004; Merkl, 2006), and others such
as genes involved in information processing, are rarely observed, it appears that in
a long run, all genes are subjected to HGT.
Sixth, GIs often contain functional or cryptic mobility genes (those genes related
to the movement of MGEs) such as integrases and transposases. These mobility
genes may indicate that a GI is autonomous or they could be remnants of other
embedded MGEs such as IS elements that are frequently found in GIs (Hacker
et al., 1997). Non-autonomous GIs can also depend on host encoded recombination
enzymes or recombination can occur among highly-similar or identical copies
of the embedded mobile elements (e.g. IS elements) resulting in rearrangement,
translocation or deletion of GIs (Hacker et al., 1997).
Lastly, many GIs are unstable and have been reported to be sporadically
excised; therefore, certain isolates may not contain the GI (O’Shea et al., 2002;
Middendorf et al., 2004) (Hochhut et al., 2006).
While it is not necessary for every feature to be present in a region for that
region to be called a GI, the simultaneous presence of a subset of these features is
generally viewed as strong evidence for the region’s horizontal origin.
2.2. Prophage
A prophage is the latent form of a prokaryotic virus known as bacteriophage or
simply phage. The movement of DNA between prokaryotic cells via a phage is
referred to as transduction. Phage can be divided to into two general groups
depending on whether they possess the ability to become dormant, called temperate
phage, or if upon infection of the host their only choice is to enter a lytic cycle (the
production of phage progeny), called virulent phage (Lwoff, 1953). The dormant
phage, upon invading the bacterial cell, will often integrate its own DNA into the
bacterium’s genome (Freifelder et al., 1970) and will be replicated for numerous
generations along with the bacterial genome. Induction provokes dormant prophage
to enter a complete lytic cycle, and this may happen spontaneously or as a
consequence of change in the bacteria’s environmental conditions. These integrated
prophages account for a large portion of the variation seen between bacterial strains
(Ohnishi et al., 2001) and can represent a substantial number of genes in a bacterial
genome (Casjens et al., 2000). Furthermore, virulence factors that contribute to a
Page 5
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 117
bacterium’s pathogenicity can be mobilized by phage and are seen as a key factor
in the evolution of new pathogens (Boyd et al., 2002).
Non-computational methods for identifying prophage within a bacterial species
depend on whether or not a susceptible host is available. If such a host exists
then spontaneous induction is sufficient for proliferation of the phage; otherwise,
the induction of a prophage would require special protocols such as the addition
of mitomycin C in Yersinia and Streptococcus strains (Yamamoto, 1967; Huggins
et al., 1977; Popp et al., 2000) along with electron microscopy (EM) and genome
analysis.
Prophage regions typically contain an integrase and several phage associated
genes. However, they can often carry other genes that are not associated with the
proliferation of the phage. Similarly to GIs, the presence of a tRNA or a flanking
direct repeat (described above) is supportive evidence that phage integration may
have occurred in a region.
2.3. Integrons
Integrons are genetic elements that utilize site-specific recombination to capture
and direct expression of exogenous open reading frames (ORFs). They were first
identified in the late 1980’s for their important role in the capture and spread
of antibiotic resistance genes (Stokes et al., 1989). Bacteria harboring integrons
possess the ability to incorporate and express genes with potentially adaptive
functions, including antibiotic resistance genes, and therefore pose a major problem
for treatment of infectious diseases (Rowe-Magnus et al., 2002). Furthermore, some
bacteria become resistant to multiple antibiotics by harboring integrons that have
captured multiple antibiotic resistance genes and, potentially, genes encoding other
traits which give the bacteria an adaptive advantage. Additionally, integrons are
often linked with other MGEs, such as plasmids and transposons, leading to rapid
dissemination of such traits within a population. A recent study reported that up to
9% of bacteria harbor integrons (Boucher et al., 2007) making them an important
player in acquisition and spread of adaptive traits and antibiotic resistance in
bacterial populations.
Integrons consist of three key elements necessary for the capture and expression
of exogenous ORFs: An integrase gene (intl) and recombination site (attl) are
necessary for acquisition of genes, and a promoter (Pc) ensures their expression. Intl,
attl and Pc comprise the 5′ conserved segment (5′CS), and the 3′ conserved segment
(3′CS) contains known genes that confer resistance to various compounds (Fig. 1).
Intl catalyzes the recombination between attl and a recombination site at the 3′
end of the gene called attC or the 59-base element (59-be). The 59-be consists of a
variable region spanning 45–128 nucleotides in length flanked by imperfect inverted
repeats at the ends designated R′ (GTTRRRY) and R′′ (RYYYAAC), where R is a
purine and Y a pyrimidine. The recombination site in the 59-be recognized by intl
is between the G and T bases of R′. An ORF and its associated 59-be is termed
Page 6
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
118 M. G. I. Langille et al.
Fig. 1. Schematic representation of a class 1 integron. IntI, integrase gene; attI, integration
site; Pc, promoter for expression of integrated gene cassettes; 59-be (attC ), site adjacent to
ORF recognized by intI; sul, sulphonamide resistance; qacE, quaternary ammonium compound
resistance; ORF, open reading frame; 59-be, 59 base element. Note that the circular cassette comes
from excision of the integrated form (not shown).
a gene cassette. These gene cassettes have been shown to be excised as covalently
closed circles that may contain more than one gene cassette linked together (Collis
et al., 1992).
All integrons characterized to date are classified as either integrons or
superintegrons. Integrons are defined as gene cassettes associated with MGEs
such as insertion sequences, transposons, and conjugative plasmids, which serve to
disseminate genes through mechanisms of HGT. Five classes of integrons have been
described, classified based on sequence homology of their integrase genes (Mazel,
2006).
Class 1 integrons are the most clinically relevant, isolated frequently from
patients with bacterial infections. Bacteria harboring class 1 integrons often confer
multi-antibiotic resistance and possess gene cassettes resistant to a wide variety of
antibiotics, including all known β-lactam antibiotics (Mazel, 2006). One such class
1 integron was identified in E. coli that contains 8 different antibiotic resistance
cassettes including a broad-spectrum β-lactamase gene of clinical importance (Naas
et al., 2001).
Association with MGEs can lead to rapid dissemination of integrons and
their associated gene cassettes through both intraspecies and interspecies transfer.
In support of this, extensive reports have identified integrons in diverse Gram-
negative bacteria and also in some Gram-positives (Hall et al., 1999; Mazel, 2006).
Superintegrons differ from integrons in that they are chromosomally located and
not linked to MGEs. They also differ in that their cassette arrays can be quite
Page 7
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 119
large in size; one unique superintegron identified in Vibrio cholerae harbors over
170 cassettes (Mazel et al., 1998; Rowe-Magnus et al., 1999).
In addition to antibiotic resistance genes, integron and superintegron gene
cassettes have also been shown to encode proteins involved in other adaptive
functions, including virulence factors, metabolic genes, and restriction enzymes
(Ogawa et al., 1993; Rowe-Magnus et al., 2001; Vaisvila et al., 2001). However,
a recent study reported that 78% of cassette-encoded genes are uncharacterized
or have no known homologs to date (Boucher et al., 2007). Therefore, more
investigation into the function and diversity of genes encoded in integrons is
needed to gain a better understanding of their adaptive importance in microbial
evolution.
2.4. Transposons and IS Elements
Barbara McClintock was the first to have observed recurring chromosomal breakages
in the same region caused by a genetic element, Ds (Dissociation), in maize in
early 1940s (McClintock, 1941). She later found another element, Ac (Activator),
in maize that must be present for the Ds element to exert chromosomal breakage.
These two elements were later proposed to be the autonomous (Ac) and non-
autonomous (Ds) members of the same transposon family (Fedoroff et al., 1983).
More generally, transposons are DNA elements having lengths ranging from a few
hundred base pairs (bps) to more than 65,000 bps, that proliferate in the host
genome and have been observed in all three kingdoms of life; bacteria, archaea and
eukaryotes.
Each group of transposons may consist of autonomous and non-autonomous
members. An autonomous transposon encodes transposition catalyzing enzymes,
called transposases, and is able to transpose itself. A non-autonomous transposon
does not encode such proteins and relies on its autonomous counterparts with
similar cis signals to transpose it. Movement of transposons is usually limited to
within a single cell, but they are often contained within other MGEs such as GIs
and prophages that allow for cell-to-cell transfer. Of course, as with any genomic
region, transposons could also be transferred between naturally competent cells via
transformation. In addition, some transposons called conjugative transposons can
move via conjugation and we will discuss these at the end of this section.
A transposon consists of one or more overlapping genes, one of which may be a
transposase (Mahillon et al., 1998; Chandler et al., 2002; Siguier et al., 2006a),
as shown in Fig. 2. For a transposon with more than one gene, the upstream
gene encodes a DNA recognition domain, while a second overlapping gene encodes
the catalytic domain in most cases (Wicker et al., 2003). Additional genes may
follow, which may alter the host phenotype. These include antibiotic resistance
genes (Stokes et al., 2007). Most transposons carry a pair of terminal inverted
repeats (TIRs) (shorter than 50 bps) at the two termini, and they are termed TIR
transposons (Fig. 2A) while a non-TIR transposon (Fig. 2B) does not harbor such
Page 8
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
120 M. G. I. Langille et al.
Fig. 2. Structures of two types of transposons in prokaryotes. (A) TIR transposon and (B) non-
TIR transposon. Both of them have autonomous and non-autonomous members. (C) A transposon
may also proteins other than a transposase.
TIR signals at the termini. Linker sequences are located between each terminal
signal and the ORF region.
The relocations of transposons could be deleterious to the host as they may
disrupt host genes by inserting into them and may alter the expression of the
neighboring genes with their endogenous promoters (Mahillon et al., 1998; Chandler
et al., 2002). Also, homologous recombination between two transposons contributes
to reorganization and deletion of chromosomal regions in the host genome (Toussaint
et al., 2002). Later studies suggested that transposons were also able to introduce
beneficial mutations to the host genome through insertion and recombination (Blot,
1994). For example, several studies have shown that transposons can give a selective
advantage to the host in specific environments by introducing recombinations in
E. coli (Zambrano et al., 1993; Naas et al., 1994; Lenski, 2004). By taking advantage
of such mutagenesis capabilities, transposons have been extensively used in genetic
engineering to mediate global insertional mutagenesis of bacteria (Ely et al., 1982;
Berg et al., 1984; Zink et al., 1984; Rella et al., 1985). Also, transposons served as
mobile priming sites to sequence DNA segments in the 1980s (Ahmed, 1985; Adachi
et al., 1987).
Insertion sequences (IS elements) are similar to autonomous DNA transposons,
in that they encode a transposase, but unlike transposons they do not encode any
genes contributing to the phenotype of the host (Adhya et al., 1969; Shapiro, 1969;
Shapiro et al., 1969). As of today, more than 1,500 IS elements have been identified
and they are classified into 20 families, with some families being subdivided into
groups, based on their genetic structures and the sequence similarities of the encoded
transposases (Siguier et al., 2006b). Recent studies suggest that ∼99% of known IS
Page 9
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 121
Fig. 3. Structure of a composite transposon, Tn5.
elements in prokaryotes have fewer than 100 copies in their host genomes (Siguier
et al., 2006b).
Two adjacent IS elements, plus intervening DNA sequence, can form a composite
transposon as shown in Fig. 3, which may carry its own protein-encoding genes
within the linking DNA sequence, e.g. the antibiotic genes in Tn5 (Berg et al., 1989;
Berg, 1989; Reznikof, 2002) and Tn10 (Haniford, 2002). Several other transposons
with much more complex structures, e.g. Tn3 (Haniford, 2002) and Tn7 (Craig,
2002), have also been characterized in prokaryotes.
Conjugative transposons (CTns) are MGEs that have features of transposons,
plasmids and phage. As with transposons, conjugative transposons excise and
integrate themselves into the genome and are traditionally named under the
nomenclature of transposons, e.g. Tn916 (Franke et al., 1981) and Tn1545 (Buu-
Hoi et al., 1980; Courvalin et al., 1987). However, conjugate transposons are similar
to plasmids in that they have a covalently closed circular transfer intermediate
that can be transferred by conjugation. This allows conjugate transposons to
be integrated within the same cell or between organisms. Contrary to plasmids,
conjugate transposons in their circular form cannot autonomously replicate and
must become integrated into a prokaryotic genome to maintain their survival
(Scott et al., 1988; Rice et al., 1994). Some conjugative transposons have site
specific integration and have integrases that are highly similar to lambdoid phages
(Poyart-Salmeron et al., 1989; Poyart-Salmeron et al., 1990). However, they differ
from phages in several aspects, including that they do not form viral particles
and are not transferred by transduction. As far as we know, no computational
prediction of conjugative transposons has been published in the literature. Reviews
on conjugative transposons may be found elsewhere (Clewell et al., 1993; Scott
et al., 1995).
2.5. Other Mobile Elements
As we have already shown in this section, MGEs are complex elements that due
to their mosaic nature and multiple methods of movement are not easily classified
or defined. Indeed, differences between transposons and IS elements, or prophage
and GIs are not always clear and represent the dynamic nature of biology and
research. Besides the most common MGEs outlined above, many other rare elements
exist such as inteins, intron-like regions that are spliced out after translation
Page 10
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
122 M. G. I. Langille et al.
(Gogarten et al., 2006), or group II introns (Dai et al., 2003; Fedorova et al., 2007).
Here we will not discuss these elements any further, we do recommend Gogarten
et al. (2006) and Dai et al. (2003) as starting locations for identification of inteins
and group II introns, respectively.
In addition to the rare MGEs mentioned above, we also will not be discussing
MGEs that do not integrate into the host genome such as plasmids and replicative
forms of phage. In contrast to the rare elements just mentioned, these MGEs
(especially plasmids) are prevalent in prokaryotes and are of medical importance
due to their ability to pass multiple antibiotic resistance factors between organisms.
However, due to the limited need for computational prediction of these MGEs they
are beyond the scope of this chapter.
3. Computational Methods for Mobile Element Prediction
Since the formations of mobile elements are macroevolutionary events, we often do
not have the luxury to observe the events in real time. It is therefore necessary to rely
on evidence available to us in the present time to infer the history of the organism’s
genomic evolution. Therefore, the features associated with mobile elements, as
outlined above, can be leveraged for carrying out bioinformatics analyses and for
building bioinformatics tools to detect mobile elements. Although the types and
features of mobile elements are varied, many tools use common approaches for
their detection. In particular, similarity searches conducted with tools such as
BLAST (Altschul et al., 1997) or FASTA (Pearson et al., 1988) are often used
to query previously curated databases of mobile elements to identify putative new
elements. However, different cutoff criteria and parameters along with additional
requirements, such as a minimum number of genes in a contiguous cluster, are
often used to produce tools that are optimal for identification of a particular mobile
element type.
The following sections describe the methods that are used for the identification
of particular mobile elements in further detail. Considering that new methods are
being constantly published we will discuss only selected methods that appear to
be commonly used. In addition, we will highlight mobile element features that may
provide additional means for detection that have not been exploited previously.
3.1. Detection of Genomic Islands
Below, we will discuss methods in the context of the two main bioinformatics
approaches to identify GIs; sequence composition and comparative genomics.
3.1.1. Sequence Composition-based Approaches
Sequence composition based approaches rely on the assumption that different
organisms exhibit different nucleotide pattern preferences that constitute their
Page 11
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 123
signatures. More phylogentically related organisms share similar preferences and,
therefore, have more similar sequence composition signatures. As a result, if a
gene or a gene-cluster whose signature deviates from the genome signature, a
plausible explanation is that this gene (or gene cluster) has a foreign origin and its
signature reflects that of the original donor. The basic form of a genome signature
is the G + C content (%G + C) of the genome, which can be thought of as
mononucleotide frequencies. Prokaryotic genomes sequenced today have a %G + C
range of approximately 20% to 75%.
A large number of studies using dinucleotide frequencies and many other studies
based on codon usage as genome signatures largely confirmed that closely related
species share more similar signatures than more distantly related species (Karlin
et al., 1998; Sandberg et al., 2001; Carbone et al., 2003). Higher order DNA patterns,
such as tetra-, hexa, and octa-nucleotide frequencies, have also been proposed to be
useful as genome signatures (see Chapter 1 for further discussion).
Factors other than HGT may contribute to observed sequence composition bias
and cause false HGT detection. For example, gene expression level has been linked
to codon usage and therefore also affects the trinucleotide frequencies. In 1982,
using then-available sequences, Gouy and Gautier analyzed codon usage patterns
in bacterial genes and confirmed that codon composition is correlated to mRNA
expression level (Gouy et al., 1982). Later on, it was shown that highly expressed
genes such as ribosomal proteins exhibit atypical composition bias (Karlin, 2001).
Also, natural variation in coding sequences can produce bias, especially, if the sample
size is too small (i.e. the sequence is too short) to generate a reliable signal. As a
consequence, while sequence composition has been used to detect HGT in single
genes (Nakamura et al., 2004; Tsirigos et al., 2005), they are perhaps more suitable
in detecting GIs because it is less likely to have a cluster of genes all exhibiting
sequence composition bias due to random noise (Karlin, 2001; Hsiao et al., 2003;
Waack et al., 2006). Another issue associated with using composition bias to detect
genes acquired horizontally is that mutational pressure acting on a foreign gene
may cause it to adapt to the host genome signature over time in a process termed
“amelioration” (Lawrence et al., 1997). It is believed that over a period of time,
the signature from the donor is lost and is replaced by the recipient’s signature.
Therefore, sequence composition bias is more suitable for detecting recent HGT.
Lastly, based on sequence composition alone, genes that are acquired from another
organism sharing the same or very similar genome signature (presumably due to
relatedness) would not be detectable (see Chapter 6 for further discussion of HGT
detection).
Despite these issues, sequence composition based approaches for detecting
horizontally acquired genetic material have been developed and improved in the
past few years and have been shown to be capable and versatile tools for detecting
GIs. All of the methods described below essentially calculate the k-mer frequencies
(k is usually from 1 to 9) for a sub-region of a genome and compare these results with
the expected frequencies from that genome. Deviation from the genome frequencies
Page 12
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
124 M. G. I. Langille et al.
is scored and if the score is above a certain cut-off, these regions are marked as
putative GIs. Below, we have highlighted the advantages and disadvantages of a
selection of representative tools.
3.1.2. SIGI and SIGI-HMM
SIGI and SIGI-HMM both use codon usage (frequency of a trinucleotide normalized
by synonymous codons) as a genome signature (Merkl, 2004; Waack et al., 2006).
The codon usage frequency table of an organism is derived either from its whole
genome if available or from its species’ entry in the CUTG codon usage database
(Nakamura et al., 1999; Tu et al., 2003). For each gene, the multiplicative-product
of the codon usage frequency from each codon in the gene is determined using the
organism’s own codon frequency table (the host table). The same multiplicative-
product is also calculated using the same gene sequence but instead of using the
organism’s own table, other organisms’ frequency tables are used (the donor tables).
Lastly, a score in the form of a normalized odds-ratio is calculated from each pair-
wise comparison between the product derived from the host table and that derived
from a donor table. The score value can be used to decide whether the codon usage
of a gene resembles more to the codon prevalence of the host species or to that of
another (putative donor) species. In the cases where the resemblance is closer to
the latter, this gene, if it meets a custom cut-off, is marked as a putative foreign
gene. Using a BLAST-like extension mechanism, non-contiguous clusters of putative
foreign genes are combined to form a putative GI until the frequency of putative
foreign genes within a region fall below a predetermined cut-off. In the original
SIGI paper, a local frequency of 2 times the genome frequency was used. So if
the frequency of putative foreign genes in a genome is determined to be 10%, the
frequency of foreign genes within a putative GI has to be at least 20% or higher.
In SIGI-HMM, the odds-ratio scores are similarly determined as the SIGI
process described above to make a list of putative foreign genes. However, instead of
using a BLAST-like extension mechanism to construct putative GIs from putative
foreign genes, the updated program used a hidden Markov model (HMM). The
HMM incorporates an alternative probabilistic model based on randomly generated
nucleotide sequences using the same amino acid sequence as the real gene product.
This alternative model provides a baseline measure for the random noise in the
sample. Moreover, an additional filter to remove potentially highly expressed genes
was also incorporated into the HMM using the codon usage of ribosomal proteins
as a reference. Using a path-generating algorithm of HMM, a final list of GIs is
predicted. All genes assigned to a putative foreign state (i.e. more similar to a
donor frequency table) are considered in GIs and these regions are further combined
if there are less than 4 native (not foreign) genes between them.
One unique feature of the SIGI and SIGI-HMM approach is its ability to
detect putative donor of GIs from its pair-wise comparison scores (the more a gene
resembles the codon usage of another organism, the more likely that organism is
related to the donor). Due to the current limited sampling of the Earth’s biomes,
Page 13
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 125
the donor often cannot be precisely predicted. However, based on the developer’s
own preliminary analysis, a false prediction at the domain taxonomical level (i.e.
Archaea and Bacteria) is less than 1%, suggesting that better sampling can help to
improve the accuracy of the donor prediction. One potential shortcoming of SIGI is
the use of codon usage as the genome signature because this measure is subjected
to the influence of gene expression level.
3.1.3. PAI-IDA
The PAI-IDA program (PAthogenicity Island–Iterative Discriminant Analysis)
uses iterative discriminant analysis on 3 different genome signatures: %G + C,
dinucleotide frequency, and codon usage (Tu et al., 2003). Initial window size of
20 kb and step size of 5 kb were used to calculate the DNA signature of a window
compared to the whole genome. A small list of known PAIs from 7 genomes was used
as the initial training data to generate the parameters used in the linear functions to
discriminate anomaly regions from the rest of the genome. Then through iteration,
the discriminant function is improved by taking additional (predicted) anomaly
regions into account. The iteration ends if the status of each region stops changing.
This algorithm was the first to demonstrate that it is possible to combine multiple
genome signatures for the detection of GIs.
3.1.4. Alien-Hunter
Alien Hunter uses “Interpolated Variable Order Motifs” (IVOMs) which generate
variable length k-mers and prefers longer k-mers over shorter k-mers as long as
there is enough information (Vernikos et al., 2006). The length k is set from 1 to 8.
The program assigns a weight to each k-mer based on its length in order to linearly
combine all the k-mer frequencies as a score. The weights are necessary because
shorter k-mers are more likely to appear than longer k-mers but longer k-mers
contain more information and are more specific. The initial sliding window size is
5 kb and the step size is 2.5 kb. IVOM vectors from a region are compared to IVOM
vectors of the genome to derive a distance score. A HMM is then used to refine the
boundaries of the HGT regions.
The advantage of this approach is in its ability to incorporate variable length
k-mers, and based on the developer’s own analyses, longer k-mers provide better
sensitivity and specificity than shorter k-mers alone (Vernikos et al., 2006).
3.1.5. Z-Curve (GC Profile)
An approach called Z-curve plots the accumulative ratio of A+T versus G+C along
a genomic sequence and uses a segmentation algorithm to detect break points where
the A+T to G+C ratio changes abruptly (Zhang et al., 2004). These break points
are hypothesized to correspond to the insertion points of a GI. Segments in between
large break points are, therefore, putative GIs. This approach has been incorporated
into a web based tool named GC Profile and is also available for download as a
Page 14
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
126 M. G. I. Langille et al.
software package (Gao et al., 2006). It should be noted that this approach does
not produce a list of GIs and relies on users to interpret the graphic outputs. As a
result, it is not suitable for automated detection of GIs.
3.1.6. IslandPath
Similarly to Z-Curve, IslandPath provides a visual interface to aid researchers in
the detection of GIs (Hsiao et al., 2003). Each gene in the genome is represented as
a small circle that has a color assigned to it depending on if it shows a significant
deviation from the GC content and dinucleotide genome average. In addition, any
mobility genes and tRNAs are given special markers on each gene circle. The end
result is a whole genome graphical view that highlights features that are associated
with GIs and allows for manual identification of putative GIs.
3.1.7. Wn-SVM
Tsirigos and Rigoutsos improved their original approach (called Wn) by
incorporating a support vector machine (SVM) to classify a gene either as
“native” or “foreign” (Tsirigos et al., 2005). SVMs have been used in many other
bioinformatics tools to classify biological entries into different classes with very good
sensitivity and specificity (for a good example and explanation of the use of SVMs
in bioinformatics see Gardy et al., 2005). While, the SVM version of Wn approach
(Wn-SVM) indeed showed improved sensitivity over the original approach (Tsirigos
et al., 2005), the paper suffered from using a simulated data set to evaluate the
approach. As a result, the actual improvement under realistic biological settings is
not clear. Nevertheless, using SVM to detect GIs presents a novel strategy.
3.1.8. Comparative Genomics-based Approaches
Comparative genomics based approaches entail the use of multiple genomes to
detect GIs. In these methods, GIs are often defined as clusters of genes in one
genome that are not present in a related genome (see Table 1). They are based
on the observation that GIs are sporadically distributed among closely related
species and can sometimes be found between very distantly related species as judged
by the degrees of sequence divergence in 16S rRNAs or other orthologs (Ragan,
2001). An example of a GI between distantly related species is a 16 kb region that
is 99% identical between some strains of Pyrococcus furiosus and some strains
of Thermococcus litoralis (Diruggiero et al., 2000). These methods can roughly
be divided into gene content based approaches and whole genome (nucleotide)
alignment based approaches (Ragan, 2001). However, due to complications in
automatically picking reference strains to carry out the comparison and the difficulty
in interpreting the comparative results, there is currently no publicly available
software package that has been published for detecting GIs using comparative
genomic approaches.
Page 15
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 127
In general, gene content based approaches use BLAST or other similarity search
tools to detect variation of gene contents between two strains. For example, 528
genes in E. coli K12 do not have homologs in E. coli O157:H7, while 1387 E. coli
O157:H7 genes cannot be detected in E. coli K12 (Perna et al., 2001). These strain-
specific genes are often found in clusters and may correspond to putative GIs.
Unusual sequence similarity between genes found in distantly related organisms
have been reported as an indicator of HGT. Such practice, in the absence of full
genomes for the organisms compared, is not recommended because it is impossible
to ascertain orthology of genes when a complete genome is not available. Moreover,
unusual sequence similarity between two species can be due to purifying selection
pressure acting on the species compounded with lineage-specific gene loss in the
intervening species.
Comparison at the nucleotide sequence level can be carried out using genome
aligners such as MAUVE and MUMmer (Delcher et al., 2002; Darling et al., 2004).
These approaches typically use BLAST-like strategy to find short conserved regions
between genomes and then extend the conserved blocks by aligning intervening
regions using more robust alignment strategies such as the Smith-Waterman
algorithm. Regions that cannot be aligned then may represent putative GIs. In
general, unless the direction of evolution is known, which is rarely the case,
it is difficult to distinguish an insertion from a deletion based on comparative
approaches. Moreover, finding strains that are within the appropriate phylogenetic
distance and with which reasonable whole-genome alignment could be achieved can
be a difficult challenge.
Comparative genomic approaches may be augmented by using additional
evidence associated with HGT. For example, a strategy developed by Ou and
colleagues used tRNAs to anchor putative GIs (Ou et al., 2006). They first
identify shared tRNAs among strains of E. coli. Then, extracting regions up-
and downstream of orthologous tRNAs, the authors used MAUVE to align these
regions to identify conserved blocks. Regions that fall between aligned upstream and
downstream blocks were investigated further as possible GIs using several filters to
remove false positives. The incorporation of tRNAs, which are often used as insertion
sites (see above), as an anchor for finding GIs can aid their identification. However,
not all GIs use tRNAs as insertion sites and thus this strategy is limited in the
types of GIs that are detectable. Better sequence or structural characterization of
other insertion site types can provide additional anchoring points for GI detection.
In summary, the exponential increase in the number of available genomes for
comparison now makes it possible to develop automated methods for comparative
genomics based detection of GIs.
3.2. Detection of Prophages
Since prophages have several features in common with GIs, they can often be
identified using many of the same approaches. Abnormal base composition from
the host genome including GC content, codon usage, and dinucleotide bias are
Page 16
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
128 M. G. I. Langille et al.
common signatures of prophage regions. In addition, the presence of mobility
genes (e.g. integrases, site-specific tyrosine and serine recombinases, lysases, etc),
direct repeats, and tRNAs support the evidence that the region has been recently
integrated (see above).
Approaches for the sole detection of prophage have also been developed that
depend on features that are unique to prophage. First, many of the genes found in
phage are quite different from those that would be natively used by bacteria. Virion
structural proteins, such as head, tail, and tail fibers that appear in close proximity
to each other can be strong indicators of a prophage region. Since phage structural
and regulatory genes are strong indicators of prophages, the most common approach
for detection is to search for these genes based on sequence similarity. This approach
usually starts by taking every gene within a bacterial genome and searching for
similar genes against a database of known phage genes that have been derived
from previously sequenced phage. To reduce the number of false identifications,
multiple genes that have significant matches to the database are required to be in a
cluster. The most stringent criteria would require that every gene within a certain
size window would have to be identified as a phage-like gene. However, there are
a couple of scenarios that could lead to a prophage region containing a gene that
does not have a significant hit. One is that the gene truly does have phage origin,
but that phage gene does not exist in the phage database (i.e. it’s a novel phage
gene). Although many phage genomes have been sequenced and recent metagenomic
studies have rapidly increased the number of phage genes in phage databases (Casas
et al., 2007), these databases should still not be considered a complete list of all
phage genes. Another scenario is that genome rearrangement has occurred since the
phage integration and this resulted in mixing of phage and bacterial genes. To allow
for this noise, a clustering technique or a sliding window is often used to find regions
with a significant number of phage-like genes. For example, Prophage Finder (Bose
et al., 2006) clusters any hits within a certain distance cutoff, ranging from 3 to 6 kb
and uses another cutoff requirement of between 5 and 10 phage hits per prophage.
Phage Finder (Fouts, 2006) on the other hand uses a sliding window with a fixed
size of 10 kb and step size of 5 kb; searching for windows with at least four hits.
These windows are then extended gene by gene if the annotated gene is known to
be associated with prophages such as tRNAs, integrases, etc.
3.3. Detection of Integrons
Computational identification of integrons in genomic sequence is complicated due
to the considerable diversity in the integron sequence between the different classes,
and the diversity of their associated gene cassette arrays. Identification of integrons
in clinically isolated bacteria initially involves both in vitro methods to identify the
presence of integrons, and downstream bioinformatics tools to functionally annotate
genes. In silico, there is not one generally adopted method to computationally
identify integrons in genomic sequence. Often multiple bioinformatics approaches
are combined to detect integrons in genomic sequence, again followed by functional
annotation of genes.
Page 17
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 129
Usually bioinformatics tools or scripts are used to detect the conserved integron
features (see above). For example, some studies use BLAST-based similarity search
to detect 59-be, integrases, transposons, or known species specific gene cassettes
(Holmes et al., 2003; Gillings et al., 2005). One study used a more high-throughput
approach to identify integrons in a global analysis encompassing multiple diverse
sequenced bacterial genomes. In this case, integrons were identified through a
BLASTP similarity search (e-value cutoff 10−25) to previously described integrases
from Vibio and Escherichia species (Szekeres et al., 2007).
Various software tools other than BLAST are used to identify these conserved
regions. One study used Transact-SQL, an extension of SQL query language, to
identify 59-be (Boucher et al., 2006). Another study utilized a software package
called Sequence Analysis (developed by the Genetics Computer Group at the
University of Wisconsin Biotechnology Center), to identify direct repeats and
conserved R’ and R” regions of 59-be (Vaisvila et al., 2001). Another study
combined multiple approaches, initially using MAP software (Genetics Computer
Group, Madison, Wisconsin) to detect ORFs, followed by a BLAST search of
predicted coding regions and 59-be (Holmes et al., 2003). Finally, one report used
characteristics of integron structure to identify superintegrons in Vibrionaceae. In
this case, they developed custom Perl scripts to detect 59-bes and genes, and
included various length constraints, such as maximum length of genes and attC
sites (Rowe-Magnus et al., 2003).
Subsequent identification and annotation of genes in gene cassettes are primarily
performed with a BLAST similarity search against the GenBank and/or GenPept
databases from the National Center for Biotechnology Information (NCBI).
Sometimes additional sequence databases such as EMBL, Uniprot and the NCBI
Microbial Genome database are also used (see Sec. 4). Additionally, in some analyses
open reading frames are predicted using ORF Finder or WebGeneMark.HMM
(Lukashin et al., 1998) also available through the NCBI. Similarly, a study used
the following criteria to identify hypothetical genes that may not necessarily be
identified through homology search: a reading frame in the opposite orientation to
intl ; a start codon within 30 bp of attl or the 59-be; a stop codon in or adjacent to
the next 59-be; and being the largest ORF bounded by two 59-be (Gillings et al.,
2005). Another study specified the longest coding region between two 59-bes as a
probable gene (Vaisvila et al., 2001).
Most methods that are used to detect integrons are designed as in-house
solutions and are never fully developed into tools that are reusable. Unfortunately,
this results in methods that are almost impossible to be compared. However,
identification of integrons will hopefully improve and allow for further tool
development.
3.4. Detection of Transposons and IS Elements
Like integron prediction, transposon and IS element prediction is fairly limited.
Primarily, identification is based on sequence similarity searches against known
transposons and IS elements. Fortunately, these elements have been previously
Page 18
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
130 M. G. I. Langille et al.
collected into web accessible databases such as ISfinder and ACLAME (Leplae et al.,
2004; Siguier et al., 2006b).
ACLAME (Leplae et al., 2004) is a database that was started in 2003 to deal
with MGEs from plasmids and viruses, and provides high quality classification
of all the encoded proteins through clustering. The current version ACLAME
0.2 provides browsing interfaces for individual mobile elements, mobile proteins
clustered in families, hosting organisms and functions defined in ACLAME. In
addition, Mahillon and Chandler collected and characterized ∼500 IS elements in
prokaryotes in 1998 (Mahillon et al., 1998), and organized the information into the
ISfinder database (Mahillon et al., 1998; Chandler et al., 2002; Siguier et al., 2006b).
Currently the database has ∼1,500 IS elements and has evolved into one of the most
comprehensive databases for IS elements in prokaryotes.
A recent example for the use of these databases was provided by Touchon and
Rocha (Touchon et al., 2007) when they scanned the genomes of 262 sequenced
prokaryotic organisms against the ∼1,500 IS elements in the ISfinder database
(Siguier et al., 2006b). The identified proteins were further grouped into the 20 IS
families based on their best matched IS elements in the ISfinder database (Siguier
et al., 2006b). They showed that an IS element could have multiple consecutive
and possibly overlapping ORFs. The family assignments were based on protein level
comparison as the linker sequences and the TIR signals for each IS element in ISfinder
were not considered by Touchon and Rocha. Through this genome-scale annotation
of IS elements in 262 prokaryotic organisms several interesting observations were
proposedby theauthors, including that the genome size is the only significantpredictor
of the number of IS elements and the density of IS elements in a host genome. A
limitation of this method is that only the coding sequence of transposases and not
other signals associatedwith IS elements such as the terminal signalswere used; hence,
possibly leading to high false positive prediction rates (Zhou et al., 2007).
One alternative approach to similarity searching is to search for TIRs. A pair of
TIRs together with other features, like a coding region in between, would strongly
suggest that a region is a transposon, and also provides the boundary information
of the transposon (Prosseda et al., 2006; Alavi et al., 2007). This approach would
be limited to finding only TIR transposons and could be one of the reasons that no
such algorithm has yet been reported.
4. Resources
ACLAME — http://aclame.ulb.ac.be/
Sequence Resources
NCBI — http://www.ncbi.nlm.nih.gov/
NCBI Microbial Genomes — http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi
EMBL — http://www.ebi.ac.uk/embl/
UniProt — http://www.ebi.ac.uk/uniprot/
Page 19
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 131
Phage
Phage Finder — http://phage-finder.sourceforge.net/
Prophage Finder — http://bioinformatics.uwp.edu/∼phage/ProphageFinder.php
NCBI Phage Database — www.ncbi.nlm.nih.gov/genomes/static/phg.html
Genomic Islands
SIGI-HMM — http://www.g2l.bio.uni-goettingen.de
PAI-IDA — http://compbio.sibsnet.org/projects/pai-ida/
Alien-Hunter — http://www.sanger.ac.uk/Software/analysis/alien hunter/
Z-Curve — http://tubic.tju.edu.cn/GC-Profile/
IslandPath — http://www.pathogenomics.sfu.ca/islandpath/
HGT-SVM — http://cbcsrv.watson.ibm.com/HGT SVM/
Insertion Elements
ISFinder — http://www-is.biotoul.fr/
Other tools
tRNAscan-SE — http://lowelab.ucsc.edu/tRNAscan-SE/
Group II introns — http://www.fp.ucalgary.ca/group2introns/
5. Discussion
Several different classes of mobile elements each have their own set of features
that allow for different detection methods to be used. In light of this, published
algorithms and methods usually focus on detection of a single type of mobile
element to avoid the complexities of designing a method that detects all mobile
elements. However, there are some general approaches that can be extended for
the detection of various mobile elements and may allow for future integration of
multiple approaches into a single detection tool.
The most common approach to identify any genomic element is to use similarity
searches against a dataset of known genetic elements. Quite often a similarity
search tool such as BLAST (Altschul et al., 1997) is used to find genes or genomic
regions with sequence similarity to an entry in a previously-curated database. These
methods are usually quite successful in finding mobile elements in unexamined
genomes; however, this type of approach has several limitations. The largest
limitation is that the sensitivity of the tool is heavily dependent on the completeness
of the known dataset of mobile elements. Any mobile element that is not similar
to a previously known mobile element will not be detected by this approach. For
example, if we are searching for prophage using a database of phage genes we are
limited to finding only the prophage that have similarities to those phage genomes
that have been previously sequenced. Novel phage genes cannot be detected using
this approach.
Page 20
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
132 M. G. I. Langille et al.
The detection of compositional bias in genomic regions can be used to aid
in the identification of regions that have been horizontally transferred and this
approach does not depend on sequence similarity. However, these methods have
their own limitations (see Sec. 3.1) and usually bias detection toward relatively
recent transfers. As the number of genomes increase, comparative approaches will
become increasingly important, yet such methods remain underdeveloped to date,
versus the plethora of sequence composition-based approaches.
As with all mobile elements, transposons are actively involved in the genome
evolution, and could introduce many types of changes to the genome, including gene
rearrangement, insertion and deletion. Hence it is interesting as well as important
to study the distributions of the transposons across the sequenced genomes to
understand the possible factors that might affect the distributions of transposons in
their host genomes (Frost et al., 2005; Wagner, 2006). Another question of interest
that could be asked based on the annotation of transposons is how transposons affect
the cellular machineries of the host organism through affecting the neighboring genes
with their endogenous promoters.
Currently only ∼220 out of the ∼1,500 IS elements in the ISfinder database
(Siguier et al., 2006b) are reported to appear in more than one organism. The general
distributions of IS elements across a genome or multiple genomes are not very well
understood and identification of all the known IS elements in a genome would be
difficult using experimental techniques. Therefore, prediction and analysis programs
with improved capabilities could help in annotation of all known transposable
elements in all sequenced genomes and lead to an improved understanding of their
distribution.
In addition to more tools, well curated and updated databases of mobile
elements are needed. Often, the most beneficial databases are those that are
successful in obtaining submissions from many researchers. ISfinder is a good
example of this as a number of journals, including Microbiology and Journal of
Bacteriology, now require authors reporting new IS elements to deposit them into
ISfinder.
Comprehensive databases also allow for thorough testing of existing and newly
developed tools. Many in silico tools for the detection of mobile elements are being
published in the scientific literature; however, quite often accuracy measurements
are not reported or comparisons between tools are based on different criteria.
Balanced and public evaluations of tools are needed to allow researchers to
effectively evaluate each tool’s capabilities. In particular, no study to date has been
performed to compare the accuracy (sensitivity and specificity) of the different
in-silico approaches for identifying integrons in genomic sequence.
Furthermore, more investigation is needed into which features are best used to
predict mobile elements like integrons. Most current integron-prediction approaches
seem to take advantage of conserved regions of 59-be to detect integrons. However,
there are known resistance gene cassettes that harbor different 59-be regions
Page 21
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 133
(Mazel, 2006). Therefore, more investigation into the diversity of these elements
is needed. In addition, to our knowledge, no paper has yet reported approaches to
identify both integrons and superintegrons in genomic sequence. With the continued
increase in genomic data, from metagenomic studies for example, there is a growing
interest in identifying all MGEs in newly sequenced genomes, and therefore more
research into producing more standard and accurate approaches is needed.
6. Summary
Mobile genetic elements are regions of DNA that are able to integrate themselves
into other genomic locations or even to other hosts. They may carry a single gene,
such as a transposase in an IS element, or a large number of genes contributing to a
common function such as pathogenicity or antimicrobial resistance. These elements
are important because they enable rapid phenotypic changes to occur, due to the
insertion of novel genes or the disruption of existing genes.
As with the detection of many other genomic elements, similarity search is
commonly used to identify MGEs. This approach can be used successfully for
the detection of MGEs because these regions contain genes such as transposases,
integrases, or phage-like genes that are not common to other genomic regions. False
positives, however, can arise through genome rearrangement. Also, similarity search
of these hallmark genes alone may not be sufficient to identify the boundaries
of the mobile elements. Since several types of mobile elements exhibit sequence
composition bias, computational approaches which measure the difference in DNA
sequence composition provides one alternative approach to identify these elements.
Additionally, comparative genomic approaches may be increasingly useful.
Currently, each in silico detection method has fairly significant limitations.
Future methods will need to tackle these limitations and integrate many of the
current approaches into universal tools. Also, the development of robust databases
of MGEs will provide critical datasets for training and testing of the computational
methods developed. In section 5.2, Features of Mobile Elements, we have described
the importance of MGEs in the development of adaptive changes in bacteria of
medical or environmental interest. Hopefully, this will stimulate the development of
more computational tools and databases that will address current limitations and
facilitate new insights regarding the evolution and function of these mobile genetic
regions.
7. Further Reading
Frost LS, Leplae R, Summers AO, Toussaint A (2005) Mobile genetic elements: the agents
of open source evolution. Nat Rev Microbiol 3:722–32.
Chandler M, Mahillon J: Insertion sequences revisited. In: Mobile DNA. Edited by A.M. L,
II. Washington, DC.: American Society for Microbiology; 2002:631–662.
Craig, N.L., Craigie, R., Gellert, M. Lambowitz, A.M. (eds) Mobile DNA II ASM Press,
Washington DC, 2002
Dawkins R: The selfish gene, 30th anniversary edition. Oxford University Press, 2007.
Page 22
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
134 M. G. I. Langille et al.
Acknowledgments
MGIL and WWLH are Michael Smith Foundation for Health Research (MSFHR)
Trainee awardees and FSLB is a MSFHR Senior Scholar. WWLH and FSLB
were also awarded a Canadian Institutes of Health Research Scholarship and
New Investigator award, respectively. FZ and YX are supported in part
by the National Science Foundation (NSF/DBI-0354771, NSF/ITR-IIS-0407204,
NSF/DBI-0542119, NSF/CCF0621700) and a Distinguished Scholar grant from the
Georgia Cancer Coalition.
Glossary
Compound transposon — A transposable element formed when two IS elements
insert on either side of a non-transposable segment of DNA.
Conjugation — Gene transfer that is mediated by certain plasmids and requires
direct cell contact.
Conjugative transposon — A transposon that encodes functions allowing transfer
of the transposon DNA between donor and recipient bacterial cells.
Cryptic genes — Phenotypically silent DNA sequences, not normally expressed
during the life cycle of the organism.
Genomic island — Clusters of genes in prokaryotic genomes that have evidence of
horizontal origins.
Horizontal gene transfer — Any process in which an organism transfers genetic
material to another cell that is not its offspring.
Insertion sequence (IS) element — A short mobile DNA sequence similar to
transposons except that they encode only genes for their transposition.
Integrase — An enzyme that is used by phage to integrate one DNA molecule into
another.
Integron — A genetic element that encodes an integrase enzyme, which can assemble
tandem arrays of genes and provide them with a promoter for expression. They are
often contained within other mobile elements allowing themselves to be mobile.
Phage — A virus that infects a prokaryotic organism.
Plasmid — A self-replicating (autonomous) circle of DNA distinct from the
chromosomal genome of bacteria. A plasmid contains genes normally not essential
for cell growth or survival.
Prophage — The dormant stage of a phage life cycle that is usually integrated in
the host genome.
Superintegrons — Integrons that are not linked to a mobile element and often have
much larger gene cassette arrays.
Page 23
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05
Mobile Genetic Elements and Their Prediction 135
Transduction — Gene transfer that is mediated by a phage.
Transformation — Gene transfer that is mediated by the uptake of naked DNA.
Transposase — An enzyme that promotes cutting of the DNA at the ends of a
transposable element and joining to the DNA molecule into which the element is to
be inserted.
Transposon — A mobile DNA element that can relocate within the genome of its
host.
Page 24
hidden
June 26, 2008 9:18 9.75 x 6.5 B-631 ch05

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

2 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Post Doc
 
50% Ph.D. Student
by Country
 
50% Italy
 
50% Canada

Groups

Publications