Computational prediction and characterization of genomic islands: insights into bacterial pathogenicity
- ISSN: 00278424
- DOI: 10.1073/pnas.120571797
Abstract
Genomic islands (GIs), including pathogenicity islands, are commonly defined as clusters of genes in prokaryotic genomes that have probable horizontal origins. These genetic elements have been associated with rapid adaptations in prokaryotes that are of medical, economical or environmental importance, such as pathogen virulence, antibiotic resistance, symbiotic interactions, and notable secondary metabolic capabilities. As the number of genomic sequences increases, the impact of GIs in prokaryotic evolution has become more apparent and detecting these regions using bioinformatics approaches has become an integral part of studying microbial evolution and function. In this dissertation, I describe a novel comparative genomics approach for identifying GIs, called IslandPick, and the application of this method to construct robust datasets that were used to test the accuracy of several previously published GI prediction programs. In addition, I will discuss the features of a new GI web resource, called IslandViewer, which integrates the most accurate GI predictors currently available. Further, the role of several GI and prophage regions and their involvement in virulence in an epidemic Pseudomonas aeruginosa strain that infects cystic fibrosis patients will be described; as well as an observation that recently discovered phage defence elements, CRISPRs, are over-represented within GIs.
Computational prediction and characterization of genomic islands: insights into bacterial pathogenicity
COMPUTATIONAL PREDICTION AND
CHARACTERIZATION OF GENOMIC ISLANDS:
INSIGHTS INTO BACTERIAL PATHOGENICITY
by
Morgan Gavel Ira Langille
B.Sc. University of New Brunswick, 2004
B.CS University of New Brunswick, 2004
THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
In the
Department of Molecular Biology and Biochemistry
© Morgan Gavel Ira Langille 2009
SIMON FRASER UNIVERSITY
Summer 2009
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without permission of the author.
APPROVAL
Name: Morgan Gavel Ira Langille
Degree: Doctor of Philosophy
Title of Thesis: Computational prediction and characterization of
genomic islands: insights into bacterial
pathogenicity
Examining Committee:
Chair: Dr. Paul C.H. Li
Associate Professor, Department of Chemistry
______________________________________
Dr. Fiona S.L. Brinkman
Senior Supervisor
Associate Professor of Molecular Biology and
Biochemistry
______________________________________
Dr. David L. Baillie
Supervisor
Professor of Molecular Biology and Biochemistry
______________________________________
Dr. Frederic F. Pio
Supervisor
Assistant Professor of Molecular Biology and
Biochemistry
______________________________________
Dr. Jack Chen
Internal Examiner
Associate Professor of Molecular Biology and
Biochemistry
______________________________________
Dr. Steven Hallam
External Examiner
Assistant Professor of Microbiology and Immunology,
University of British Columbia
Date Defended/Approved: Thursday April 16, 2009
ABSTRACT
Genomic islands (GIs), including pathogenicity islands, are commonly defined as
clusters of genes in prokaryotic genomes that have probable horizontal origins. These
genetic elements have been associated with rapid adaptations in prokaryotes that are of
medical, economical or environmental importance, such as pathogen virulence, antibiotic
resistance, symbiotic interactions, and notable secondary metabolic capabilities. As the
number of genomic sequences increases, the impact of GIs in prokaryotic evolution has
become more apparent and detecting these regions using bioinformatics approaches
has become an integral part of studying microbial evolution and function. In this
dissertation, I describe a novel comparative genomics approach for identifying GIs,
called IslandPick, and the application of this method to construct robust datasets that
were used to test the accuracy of several previously published GI prediction programs.
In addition, I will discuss the features of a new GI web resource, called IslandViewer,
which integrates the most accurate GI predictors currently available. Further, the role of
several GI and prophage regions and their involvement in virulence in an epidemic
Pseudomonas aeruginosa strain that infects cystic fibrosis patients will be described; as
well as an observation that recently discovered phage defence elements, CRISPRs, are
over-represented within GIs.
Keywords:
bioinformatics; genomic islands; horizontal gene transfer; phylogenomics;
comparative genomics; evolution; bacteria; archaea; pathogenesis; phage
DEDICATION
In memory of my uncle,
Dr. Stephen Kerr.
I wish you were here to read this.
ACKNOWLEDGEMENTS
I would like to express my sincerest thanks to my supervisor, Dr. Fiona
Brinkman, for her support, guidance, and great insight. In addition, I would like to
thank my committee members Dr. David Baillie and Dr. Frederic Pio for their
positive suggestions and guidance. I would like to acknowledge all of the
collaborators from the LES project, including, Drs. Craig Winstanley, Roger
Levesque, Bob Hancock, and Nicholas Thomson. Furthermore, I would like to
thank all of the members of the Brinkman Lab especially Dr. William Hsiao for
sharing his knowledge on genomic islands. In addition, I would like to thank both
the supervisors and students of the Bioinformatics Training Program with a
special acknowledgment to Benjamin Good for the many discussions and honest
opinions on academic life and science.
Lastly, I would like to thank my parents and brothers for always supporting
me, and my wonderful wife Sylvia and son Gavin, who made this journey a truly
enjoyable adventure.
TABLE OF CONTENTS
Approval .............................................................................................................. ii
Abstract .............................................................................................................. iii
Dedication .......................................................................................................... iv
Acknowledgements ............................................................................................ v
Table of Contents .............................................................................................. vi
List of Figures .................................................................................................. viii
List of Tables ..................................................................................................... ix
Glossary .............................................................................................................. x
Chapter 1 Introduction ................................................................................... 1
1.1 Horizontal gene transfer ..................................................................... 1
1.2 Mobile genetic elements ..................................................................... 3
1.2.1 Prophage ........................................................................................ 4
1.2.2 Integrons ......................................................................................... 4
1.2.3 Transposons and IS elements ........................................................ 8
1.2.4 Genomic islands ........................................................................... 13
1.3 Detection of genomic islands............................................................ 22
1.3.1 Sequenced composition based methods ...................................... 22
1.3.2 Comparative genomics methods................................................... 25
1.3.3 Databases and other computational resources ............................. 28
1.4 Goal of present research .................................................................. 30
Chapter 2 IslandPick: A comparative genomics approach for
genomic island identification .......................................................................... 32
2.1 Introduction ...................................................................................... 32
2.2 MicrobeDB ....................................................................................... 33
2.3 Identifying genomic islands using a comparative genomics
approach .......................................................................................... 34
2.4 Automated selection of comparison genomes .................................. 35
2.4.1 Calculating genome distances ...................................................... 37
2.4.2 Genome selection parameters ...................................................... 40
2.5 Genomic island predictions using IslandPick ................................... 42
2.6 Developing a negative dataset of GIs ............................................... 43
2.7 Discussion ........................................................................................ 44
Chapter 3 Evaluating sequence composition based genomic island
prediction methods .......................................................................................... 48
3.1 Introduction ...................................................................................... 48
3.2 Comparison with sequence composition based GI prediction
methods ........................................................................................... 48
3.3 Comparison with previously published genomic islands ................... 51
3.4 Comparison of sequence composition based approaches
using additional GI datasets constructed with more relaxed
criteria. ............................................................................................. 52
3.5 Discussion ........................................................................................ 53
Chapter 4 IslandViewer: An integrated interface for computational
identification and visualization of genomic islands ...................................... 58
4.1 Introduction ...................................................................................... 58
4.2 Implementation ................................................................................. 58
4.3 Selection and integration of genomic island prediction methods ...... 59
4.4 Features and design of IslandViewer ............................................... 60
4.5 Discussion ........................................................................................ 62
Chapter 5 The role of genomic islands in the virulent
Pseudomonas aeruginosa Liverpool Epidemic Strain .................................. 64
5.1 Introduction ...................................................................................... 64
5.2 Genome annotation .......................................................................... 68
5.2.1 Virulence genes ............................................................................ 70
5.2.2 Motility organelles ......................................................................... 71
5.2.3 Phenazine biosynthesis ................................................................ 73
5.2.4 Lipopolysaccharide (LPS) ............................................................. 73
5.2.5 Antibiotic Resistance ..................................................................... 75
5.3 Identification of prophage and genomic islands within LES .............. 78
5.3.1 LES bacteriophage gene clusters ................................................. 78
5.3.2 LES genomic islands .................................................................... 81
5.4 Signature tagged mutagenesis of LESB58 ....................................... 85
5.4.1 In-vivo analysis of STM mutants having insertions in
prophage and genomic islands ..................................................... 89
5.5 Conclusions ...................................................................................... 90
Chapter 6 CRISPRs and their association with genomic islands ............ 93
6.1 Introduction ...................................................................................... 93
6.2 Over representation of CRISPRs within GIs ..................................... 94
6.3 GIs and CRISPRs have more phage genes ..................................... 97
6.4 Conclusions ...................................................................................... 98
Chapter 7 Concluding Remarks ................................................................ 100
Appendix ......................................................................................................... 103
Reference List ................................................................................................. 104
LIST OF FIGURES
Figure 1.1 Schematic representation of a class 1 integron. ............................. 7
Figure 1.2 Structures of two types of transposons in prokaryotes. ................. 11
Figure 1.3 Structure of a composite transposons, Tn5. ................................. 13
Figure 1.4 Popularity of the terms “genomic islands” and “pathogenicity
islands” in research paper abstracts archived in the PubMed
database. ............................................................................................. 15
Figure 1.5 A general schematic of the class structure of MGE
definitions. ........................................................................................... 17
Figure 1.6 Graphical representation of several genomic features
associated with GIs. ............................................................................ 21
Figure 2.1 Pipeline of the IslandPick prediction program. .............................. 37
Figure 2.2 A Pseudomonas species tree with overlaid CVTree
distances. ............................................................................................ 39
Figure 2.3 Effect of IslandPick comparison genome cut-offs on a
sample genome tree. ........................................................................... 42
Figure 3.1 Accuracy calculations using IslandPick derived positive and
negative datasets. ............................................................................... 49
Figure 4.1 A screenshot of the IslandViewer interface. .................................. 61
Figure 5.1 Circular map of the P. aeruginosa LES genome. .......................... 77
Figure 5.2 Phage clusters identified in LESB58 with significant
similarities and positioning of STM mutants after in-vivo
screening. ............................................................................................ 79
Figure 5.3 GIs identified in LESB58 with significant similarities and
positioning of STM mutants after in-vivo screening. ............................ 83
Figure 5.4 Alignment of LESGI-3 and four other previously published
GIs in P. aeruginosa. ........................................................................... 84
Figure 5.5 In-vivo competitive index (CI) of four STMs within P.
aeruginosa LESB58. ............................................................................ 90
Figure 6.1 Typical structure of a CRISPR system. ......................................... 94
LIST OF TABLES
Table 1.1 List of features associated with genomic islands ......................... 19
Table 1.2 GI prediction programs. ................................................................ 26
Table 1.3 GI databases and other computational resources ........................ 30
Table 3.1 Average number of GI predictions and accuracy
measurements of several GI prediction tools. ..................................... 51
Table 5.1 P. aeruginosa LESB58 genome statistics ..................................... 70
Table 5.2 Motility defect in LES isolates. ...................................................... 72
Table 5.3 Predicted pseudogenes in P. aeruginosa LESB58. ...................... 74
Table 5.4 Identified genomic islands and prophage regions. ........................ 81
Table 5.5 List of 47 LESB58 virulence associated genes. ............................ 87
Table 6.1 Over-representation of CRISPRs in GIs. ...................................... 96
Table 6.2 Over-representation of genes with ‘phage’ annotation in
CRISPRs and GIs. ............................................................................... 98
CHAPTER 1 INTRODUCTION
Portions of this chapter have been previously published in the book chapter
“Mobile genetic elements and their prediction”, co-authored by M.G.I. Langille, F.
Zhou, A. Fedynak, W.W.L. Hsiao, Y. Xu, and F.S.L. Brinkman In Y. Xu and J.P.
Gogarten (eds.), “Computational Methods for Understanding Bacterial and
Archaeal Genomes”, Series on Advances in Bioinformatics and Computational
Biology, Vol. 7. Imperial College Press, London, 2008 ©2008 Imperial College
Press
1.1 Horizontal gene transfer
Bacteria are the most abundant Domain of life that exists on earth (based
on biomass) (Suttle, 2005). The species we see today are highly diverse,
reflecting adaptations to a wide range of environments over billions of years. One
of the major sources of adaptability for bacteria is the ability to obtain genes
horizontally from other sources, including other prokaryotes, viruses, and even
eukaryotes (Ochman, et al., 2000). Horizontal gene transfer (HGT) can occur by
one of three major mechanisms: transformation, conjugation, and transduction.
Transformation is the process by which bacteria uptake naked DNA from
their environment (Griffiths, 1928). This transfer method has been shown to be
naturally present across various taxa from both the Bacteria and Archaea
Domains of life (Lorenz and Wackernagel, 1994). Any cell that is able to uptake
naked DNA is considered “competent”. This competence state is often an
inducible phenotype in response to an environmental stimulus, while some
strains exhibit constant competence such as Neisseria gonorrhoeae and
Haemophilus influenza (Dubnau, 1999). The process of transformation starts with
pose a major problem for treatment of infectious diseases (Rowe-Magnus, et al.,
2002). Furthermore, some bacteria become resistant to multiple antibiotics by
harbouring integrons that have captured multiple antibiotic resistance genes and,
potentially, genes encoding other traits that give the bacteria an adaptive
advantage. Additionally, integrons are often linked with other MGEs, such as
plasmids and transposons, leading to rapid dissemination of such traits within a
population. In 2007, it was estimated that approximately 10% of the partially or
completely sequenced genomes in the Bacteria domain contained integrons
(Boucher, et al., 2007), making them an important player in acquisition and
spread of adaptive traits and antibiotic resistance in bacterial populations.
Integrons consist of three key elements necessary for the capture and
expression of exogenous ORFs: An integrase gene (intI) and recombination site
(attI) are necessary for acquisition of genes, and a promoter (Pc) ensures their
expression. IntI, attI and Pc comprise the 5’ conserved segment (5’CS), and the
3’ conserved segment (3’CS) contains known genes that confer resistance to
various compounds or provide additional metabolic function (Figure 1.1). IntI
catalyzes the recombination between attI and a recombination site at the 3’ end
of the gene called attC or the 59-base element (59-be). The 59-be consists of a
variable region spanning 45-128 nucleotides in length flanked by imperfect
inverted repeats at the ends designated R’ (GTTRRRY) and R’’ (RYYYAAC),
where R is a purine and Y a pyrimidine. The recombination site in the 59-be
recognized by intI is between the G and T bases of R’. An ORF and its
associated 59-be is termed a gene cassette. These gene cassettes have been
shown to be excised as covalently closed circles that may contain more than one
gene cassette linked together (Collis and Hall, 1992).
All integrons characterized to date are classified as either integrons or
superintegrons. Integrons are defined as gene cassettes associated with MGEs
such as insertion sequences, transposons, and conjugative plasmids, which
serve to disseminate genes through mechanisms of HGT. Five classes of
integrons have been described, classified based on sequence homology of their
integrase genes (Mazel, 2006). Class 1 integrons are the most clinically relevant,
isolated frequently from patients with bacterial infections. Bacteria with class 1
integrons often confer multi-antibiotic resistance and possess gene cassettes
resistant to a wide variety of antibiotics, including all known β-lactam antibiotics
(Mazel, 2006). One such class 1 integron was identified in E. coli that contains 8
different antibiotic resistance cassettes including a broad-spectrum β-lactamase
gene of clinical importance (Naas, et al., 2001). Association with MGEs can lead
to rapid dissemination of integrons and their associated gene cassettes through
both intraspecies and interspecies transfer. In support of this, extensive reports
have identified integrons in diverse Gram-negative bacteria and in some Gram-
positives (Hall, et al., 1999; Mazel, 2006).
Superintegrons differ from integrons in that they are chromosomally
located and not linked to MGEs. They also differ in that their cassette arrays can
be quite large; one unique superintegron identified in Vibrio cholerae harbours
over 170 cassettes (Mazel, et al., 1998; Rowe-Magnus, et al., 1999).
Insertion Sequences (IS elements) are similar to autonomous DNA
transposons, in that they encode a transposase, but unlike transposons they do
not encode any genes contributing to the phenotype of the host and are typically
much smaller than transposons (Adhya and Shapiro, 1969; Shapiro, 1969;
Shapiro and Adhya, 1969). As of today, more than 1,500 IS elements have been
identified and they are classified into 20 families, with some families being
subdivided into groups, based on their genetic structures and the sequence
similarities of the encoded transposases (Siguier, et al., 2006). Recent studies
suggest that ~99 % of known IS elements in prokaryotes have fewer than 100
copies in their host genomes (Siguier, et al., 2006).
A transposon consists of one or more overlapping genes, one of which
may be a transposase (Chandler and Mahillon, 2002; Mahillon and Chandler,
1998; Siguier, et al., 2006), as shown in Figure 1.2. Additional genes may follow,
which may alter the host phenotype such as antibiotic resistance genes (Stokes,
et al., 2007). Most transposons carry a pair of terminal inverted repeats (TIRs)
(shorter than 50 bps) at the two termini, and they are termed TIR transposons
(Figure 1.2A) while a non-TIR transposon (Figure 1.2B) does not harbour such
TIR signals at the termini. Linker sequences are located between each terminal
signal and the ORF region.
The relocation of transposons could be deleterious to the host as they
may disrupt host genes by inserting into them and may alter the expression of
the neighbouring genes with their endogenous promoters (Chandler and
Mahillon, 2002; Mahillon and Chandler, 1998). Also, homologous recombination
between two transposons contributes to reorganization and deletion of
chromosomal regions in the host genome (Toussaint and Merlin, 2002). After
transposons were initially found many studies suggested that transposons were
able to introduce beneficial mutations to the host genome through insertion and
recombination (Blot, 1994). For example, several studies have shown that
transposons can give a selective advantage to the host in specific environments
by introducing novel gene mutants in E. coli (Lenski, 2004; Naas, et al., 1994;
Zambrano, et al., 1993). By taking advantage of such mutagenesis capabilities,
transposons have been extensively used in genetic engineering to mediate
global insertional mutagenesis of bacteria (Berg, et al., 1984; Ely and Croft,
1982; Rella, et al., 1985; Zink, et al., 1984).
Two adjacent IS elements, plus intervening DNA sequence, can form a
composite transposon as shown in Figure 1.3, which may carry its own protein-
encoding genes within the linking DNA sequence, e.g. the antibiotic genes in Tn5
(Berg, 1989; Reznikof, 2002) and Tn10 (Haniford, 2002). Several more
transposons with much more complex structures, e.g. Tn3 (Haniford, 2002) and
Tn7 (Craig, 2002), have also been characterized in prokaryotes.
Conjugative transposons (CTns) are MGEs that have features of
transposons, plasmids and phage (Clewell and Flannagan, 1993; Scott and
Churchward, 1995). As with transposons, conjugative transposons excise and
integrate themselves into the genome and are traditionally named under the
nomenclature of transposons, e.g. Tn916 (Franke and Clewell, 1981) and
Tn1545 (Buu-Hoi and Horodniceanu, 1980; Courvalin and Carlier, 1987).
However, conjugative transposons are similar to plasmids in that they have a
covalently closed circular transfer intermediate that can be transferred by
conjugation. This allows conjugative transposons to be integrated within the
same cell or between organisms. Contrary to plasmids, conjugative transposons
in their circular form cannot autonomously replicate and must become integrated
into a prokaryotic genome to maintain their survival (Rice and Carias, 1994;
Scott, et al., 1988). Some conjugative transposons have site-specific integration
and have integrases that are highly similar to lambdoid phages (Poyart-
Salmeron, et al., 1989; Poyart-Salmeron, et al., 1990), but do not form viral
particles and therefore are not transferred by transduction.
Figure 1.3 Structure of a composite transposons, Tn5.
The Tn5 composite transposon contains two IS elements, IS50L and IS50R,
both of which have terminal inverted repeats at their ends (denoted by
triangles). The boxes between these IS elements represent genes that can be
carried by the composite transposon and in some cases are antibiotic
resistance genes.
1.2.4 Genomic islands
In 1990, researchers identified many virulence genes clustered together
on the chromosome of several E. coli strains that were not present in others
(Hacker, et al., 1990). These clusters of genes were thought to have been
horizontally transferred and based on their association with or presence of
virulence determinants were referred to as pathogenicity islands (PAIs). Later
studies suggested that other types of islands, besides PAIs, could exist with
genes related to other functions such as “secretion islands”, antimicrobial
“resistance islands” and “metabolic islands” (Hacker, et al., 1997). GI was then
used as a more general term that referred to any cluster of genes, typically 10-
200 kilobases in length, with horizontal origins (Hacker and Kaper, 2000). An
increase in the use of the terms “pathogenicity islands” and “genomic islands”
has continued since these terms were first used (Figure 1.4). This definition of
Figure 1.5 A general schematic of the class structure of MGE definitions.
The fairly broad definition of GIs (large genomic regions with probable
horizontal origins) allows several other MGEs to be grouped within GIs and
illustrates how many GI prediction methods can be applied to other MGEs.
GIs share several sequence and structural features that help to distinguish
them from the rest of a given prokaryotic genome (Table 1.1, Figure 1.6).
One of the most pronounced features is that their phyletic patterns differ
from their host genome, resulting in GIs being sporadically distributed (i.e. only
found in some isolates from a given species or strain). Even within a specific
strain, there have been several reports showing that GIs are unstable and have
the ability to sporadically excise (Hochhut, et al., 2001; Middendorf, et al., 2004).
Sequence similarity tools such as BLAST (Altschul, et al., 1997) can be used to
search for genomic regions that are present in one particular species/strain, while
being absent in several related species, as a relatively simplistic method for
identifying GIs. In addition, whole genome sequence alignment tools such as
Mauve (Darling, et al., 2004) can be used to observe conserved genomic regions
(based on alignment of multiple related sequences) surrounding apparent newly
inserted regions, providing some confidence that a particular region is likely a GI.
Because of the different genome sequence compositions (such as G+C
content) that different species lineages or bacteria may exhibit, GIs will often
have a sequence composition that is significantly different from their new host
genome. Sequence composition-based GI predictors heavily depend on this to
identify islands. The simplest measure of sequence composition bias is G+C
content (%G+C), but oligo-nucleotides of varying lengths (typically 2-9
nucleotides) are being increasingly used (Karlin, 2001; Karlin, et al., 1998;
Sandberg, et al., 2001; Tsirigos and Rigoutsos, 2005; Vernikos and Parkhill,
2006). These measurements are often compared against the average
composition of the entire genome and various methods utilize this feature to
identify HGT and GIs. However, using only sequence composition bias to identify
GIs has several well known flaws. First, highly expressed genes, such as those
within ribosomal protein operons, often have a sequence composition that is
significantly different from the rest of the genome (Karlin, 2001), resulting in false
positive predictions of GIs. Second, any GIs that originated from species with a
similar sequence composition as their current host bacterial genome will not be
easily detectable. Third, mutational pressure acting on a foreign gene may cause
it to adapt to the host genome signature over time in a process termed
comparison genomes that are used in the analysis. For example, the inclusion of
very distant genomes (with extensive rearrangements) in the comparison could
make alignment of genomes difficult and lead to false positive predictions. Using
at least one genome that has more recently diverged (i.e. large regions of
conserved synteny) may result in more robust predictions of GIs, however, if the
genomes are too closely related then GIs that have inserted before the
divergence of the genomes will not be predicted. Again, a comparative genomics
based approach depends on the availability of several related genomes being
already sequenced, and for some genomes, there are no closely related
genomes yet available to perform this comparison. However, with the rapid
increase of sequenced genomes this limitation would continue to diminish and a
comparative genomics approach would likely increase in utility.
1.3.2.1 MobilomeFINDER
When I started my thesis project in 2004 no comparative genomics based
GI prediction method had yet to be published. However, since then one other
method called MobilomeFINDER has been published (Ou, et al., 2007).
MobilomeFINDER focuses on identifying those islands that are associated with
tRNAs, a site that GIs often use as integration points. The method starts by
identifying shared tRNAs among several related genomes and then uses Mauve
to search for GIs within the up- and downstream regions of these orthologous
tRNAs (Ou, et al., 2006). The extra requirement of having a tRNA nearby the
predicted GI makes this method quite robust; however, not all GIs use tRNAs as
insertion sites and so this results in many GIs being missed (Hsiao, et al., 2005).
In addition to this limitation, MobilomeFINDER requires the manual selection of
both the query and the comparative genomes as input, which may result in
inconsistent selection criteria due to the unfamiliarity of different phylogenetic
distances within genera.
1.3.3 Databases and other computational resources
In addition to the GI prediction programs listed above, there are several
other computational resources that can be useful in GI research (Table 1.3).
1.3.3.1 IslandPath
IslandPath provides a visual interface to aid researchers in the detection
of GIs (Hsiao, et al., 2003). Each gene in the genome is represented as a small
circle that has a colour assigned to it based on whether it has significant
deviation from the G+C content of the genome. Genes that have unusual
dinucleotide bias are also marked with a strikethrough. In addition, any mobility
genes and tRNAs are marked with additional shapes. The result is a clickable
whole genome graphical view that highlights features that associated with GIs
and aids manual identification of putative GIs.
1.3.3.2 MOSAIC
MOSAIC is a database that contains pre-computed whole genome
alignments for several bacteria species (Chiapello, et al., 2005). Users can
browse and download conserved and “variable” regions for genomes within the
database, with the variable regions being potentially GIs.
1.3.3.3 Islander
Islander is a database of 84 GIs and their tRNA integration sites for 106
genomes (Mantri and Williams, 2004). GI predictions were made by using tRNAs
and tmRNAs predicted by tRNAscan-SE (Lowe and Eddy, 1997) and BRUCE
(Laslett, et al., 2002) in a BLAST search and filtering out regions that do not
contain an integrase genes. GIs can be browsed by GI name, organism name, or
integration site (e.g. all GIs inserted in leucine tRNAs).
1.3.3.4 PAIDB
PAIDB is a database that provides GI information for those regions that
are homologous to previously described pathogenicity islands (PAIs) (Yoon, et
al., 2006). They call these regions PAI-like and any of these regions that also
show sequence composition bias using %G+C are labelled as candidate PAIs
(cPAIs). PAIs can be browsed by species, text searched, or searched with
BLAST.
1.3.3.5 VFDB
VFDB (Virulence Factor Database) contains curated lists of virulence
factors and pathogenicity islands for several species (Yang, et al., 2008). In
addition, a larger number of virulence factor related genes are listed based on
similarity to known virulence factors. These can be browsed by species, text
searched, or searched with BLAST and PSI-BLAST.
Table 1.3 GI databases and other computational resources
Resource
Name
Description Query and Download Options
IslandPath Aids in GI detection by visualizating
multiple features of GIs
(dinucleotide bias, mobility genes,
tRNAs, etc.) in a single genomic
view.
Whole genome graphical view is
clickable and provides browsing of
gene annotations
MOSAIC Contains pre-computed whole
genome alignments for several
bacteria species
Conserved and variable (potential
GIs) can be browsed and
downloaded
Islander Database of GIs within tRNA and
tmRNA integration sites
GIs can be browsed by organism,
GI name, or integration site
PAIDB Contains GIs (identified by %G+C
bias) that are homologous to
previously described PAIs
Predictions can be browsed by
species or by searching with text or
BLAST
VFDB Contains curated and putative
(similarity based) virulence factors
as well as a small list of curated
PAIs
BLAST and PSI-BLAST can be
used to search for virulence factors.
PAIs can be found using a text
search or by browsing species
1.4 Goal of present research
At the onset of my project, there were approximately 200 completely
sequenced prokaryotic genomes and no method that used comparative
genomics to identify GIs had yet to be developed. In addition, several sequence
composition based GI prediction methods had been published but, surprisingly, a
thorough comparison of their accuracy had not been conducted. In addition,
many of these methods were not user friendly or easily accessible by the
researchers needing to use them for new genome sequencing projects. Lastly,
several studies had previously shown that pathogenic strains of bacteria often
contained GIs that were not present in their non-pathogenic relatives, but few
studies had shown direct evidence that these GIs provided an in-vivo competitive
advantage in the infected host organism.
resource I developed called MicrobeDB), I use stringent but potentially flexible
criteria, with distance cut-offs, to select query genomes that have a sufficient
number of suitably related species or strains to conduct an analysis of GIs.
IslandPick is then used to identify robust datasets of GIs from several genomes
for the primary purpose of creating a benchmark that can be used for evaluating
previously published sequence composition based GI prediction methods. As
additional genome sequences become available, IslandPick will become
increasingly useful for GI prediction and can be applied to those genomes to
expand the benchmark dataset in a consistent and automated fashion.
2.2 MicrobeDB
A new in-house database that stores all completely sequenced genomes
from National Center for Biotechnology Information (NCBI) and an application
programming interface (API) for access to this data, called MicrobeDB, was
constructed to aid in the analysis of large scale bacterial genomic studies. All
sequenced genomes are downloaded monthly from the NCBI FTP server
(ftp://ftp.ncbi.nih.gov/genomes/Bacteria) and stored locally. Information at the
genome project, replicon (chromosome or plasmid), and gene level are parsed
from these files and stored in a MySQL database (see database schema in
Appendix File 2.1). Each monthly download is given a new version number so
that experiments can be conducted on a stable “snap-shot” of the currently
available genomes and annotations at a given time. Information within the
database can be accessed directly by MySQL queries, or through a novel Perl
API that allows easier access to all of the data.
Figure 2.2 A Pseudomonas species tree with overlaid CVTree distances.
The species tree was constructed using the conserved genes carB and Omp85 using maximum parsimony. Boot strap
values are shown on the inner nodes of the tree. The CVTree distances (shown on the leaves of the tree) are pair-wise
distances to P. syringae B728a. This is only one example of several trees that were inspected to confirm that CVTree was
calculating suitable species distances.
from the query genome (Figure 2.3C). Lastly, it is required that there be at least
three suitable comparison genomes for each query genome to be used for further
analysis (Figure 2.3D).
These entire set of cut-offs can be changed to permit prediction of GIs
acquired from different time frames. For example, by increasing the "Minimum
Distance Cutoff" and the "Single Close Genome Cutoff" the period of time that GI
acquisitions are detected is changed by choosing more divergent genomes for
the analysis. Overall, the parameters, in particular the default parameters, were
selected to ensure high precision and confidence in the resulting predictions, so
that they could be used to fairly evaluate the accuracy of several sequence
composition based GI prediction tools (see Chapter 3). The parameters were not
changed to maximize the accuracy scores of any GI prediction tools that were
evaluated; however, default parameters resulted in the highest apparent
accuracy when GI datasets were compared with a curated, literature-based
dataset (see section 3.3 below).
comparative analysis (see section 2.4.2 above). One hundred and seventy three
chromosomes met the requirements for the prediction of GIs and a subset of 134
chromosomes contained GIs while IslandPick did not detect GIs in the other 39
chromosomes (see genome list in Appendix File 2.2). Many of these 39 genomes
may contain GIs that are smaller than 8kb or have other cases of HGT that were
not being targeted by IslandPick. The dataset was further reduced to 118
chromosomes, because a negative dataset could not be predicted for 14
chromosomes and the GI prediction tool SIGI-HMM gave errors on another two
chromosomes (see section 2.6 and Chapter 3). In total, I identified 771 GIs,
comprising 12.4Mb and ranging in size from 8-105kb, within 118 chromosomes
from 117 different strains and 22 genera (see Appendix File 2.3). These putative
GIs contained a total of 11,404 annotated genes with an average of 14.8
genes/GI and 97.5 genes/strain (see Appendix File 2.4).
2.6 Developing a negative dataset of GIs
In order to evaluate the accuracy of several previously developed GI
predictors (see Chapter 3), a dataset of genomic regions that were not likely to
contain GIs was constructed (negative dataset). The IslandPick pipeline was
adapted to identify large genomic regions that were conserved in several
genomes. These regions are likely to form the stable backbone of the genomes
and are unlikely to be acquired by HGT among the strains considered. A multiple
genome alignment of each query genome and all comparison genomes
previously selected (see section 2.4 above) was performed using Mauve with
minimum backbone length and maximum gap size parameters set to 8000 and
300, respectively. The regions that were conserved across all genomes were
extracted from Mauve’s backbone output file. These conserved genomic regions
were identified for the same 134 query genomes that were used for prediction of
GIs. Conserved regions larger than 8000 base pairs could not be identified for 14
of these chromosomes and so these were removed from both the positive and
negative datasets. The resulting negative dataset was about 4 times larger than
the IslandPick dataset of GIs, containing approximately 50.6 Mb over 3770
separate genomic regions (see Appendix File 3.1). The size difference between
the negative and positive datasets was expected since the proportion of HGT
versus conserved backbone in a genome is normally much smaller (Daubin and
Ochman, 2004; Vernikos, et al., 2007; Waack, et al., 2006).
2.7 Discussion
I have introduced and outlined, IslandPick, a novel automated method for
predicting GIs using comparative genomics. To date, this is the first attempt at
trying to automate genome selection for comparative genomics. I have used
IslandPick, with its stringent default criteria, to generate datasets of GIs and non-
GI regions that can be used to evaluate the accuracy of multiple sequence
composition based GI predictors (see Chapter 3).
Of course, there are some limitations to predicting GIs using comparative
genomics. The choice of genomes for comparison to each query genome can
result in differences in the predicted GIs. IslandPick’s genome selection criterion
uses several distance cutoffs to minimize this bias as much as possible (example
given in the next paragraph). GIs could be present in the negative dataset if a GI
inserted before the divergence of all genomes examined. To minimize these in
my datasets, IslandPick requires that at least three comparison genomes be
used for each query genome and that at least one comparison genome is at least
some minimal distance away from the query genome. The number of false
positive GI predictions is minimized by requiring that any putative GI is present
only in the query genome when compared to all comparison genomes.
Therefore, a deletion of the same genomic region would need to occur in three or
more strains for it to be mis-predicted as a GI in my analysis. Similarly, a GI that
inserted into multiple genomes would have to be conserved in all of the diverse
genomes studied, to be improperly placed in the negative dataset. Although
using several rules in the genome selection process results in very stringent
datasets of GI and non-GI regions, it does limit the number of organisms that can
be used by IslandPick. Relaxing the genome selection process by the removal of
some of these cut-offs would allow IslandPick to be applicable to more genomes.
It should be emphasized that IslandPick was not developed to be a GI prediction
tool that would replace sequence-based composition tools, which can be used on
any genome without the requirement of having several other comparative
genomes; rather, the IslandPick approach allows the testing of these tools and in
certain cases can also be used for GI prediction (cases that should increase
notably in the future, as more and more genomes are sequenced).
As an example of the IslandPick approach, when Salmonella enterica
Typhi CT18 is used as a query genome to identify islands using the default
cutoffs, very closely related genomes including S. enterica Typhi Ty2 and S.
enterica Paratyphi A str. ATCC 9150 were excluded from comparison. Therefore,
IslandPick identifies GIs that have inserted after the divergence of S. enterica
Typhi CT18 and the next most related genome that has been sequenced, which
is S. typhimurium LT2. Islands that inserted before the divergence of CT18 and
LT2 would also not be included in the positive dataset, using these stringent
cutoffs. However, IslandPick requires that at least one genome be a certain
distance from the query genome (Shigella dysenteriae Sd197 in this example),
so that these more ancient GIs are not improperly included in the negative
dataset. It is assumed that any sequences shared between the query genome
(e.g. Salmonella enterica Typhi CT18) and the comparative genomes including
those that meet the single distant genome cutoff (e.g. S. dysenteriae Sd197) are
sufficiently stable and can be considered as the conserved genome backbone.
Again, distance cutoffs can be modified in IslandPick to detect islands that are
more ancient or those acquired more recently.
In many instances, IslandPick tends to split large islands into smaller
ones, which is probably the result of a few similar genes being identified in one or
more of the comparison genomes. Considering that as an island gets bigger
there is a greater chance of detecting some similarity between the genomic
regions being compared, one would expect that very large GIs might be split into
smaller ones. As indicated in recent research, this limitation could be improved in
the future by spanning together islands that are interrupted by only small regions
of low similarity (Azad and Lawrence, 2007).
Table 3.1 Average number of GI predictions and accuracy measurements of several GI
prediction tools.
Tool
Average
number of
nucleotides in
GIs per genome
(kb)
Precision Recall Overall Accuracy
SIGI-HMM 232.7 92.3 33.0 86.3
IslandPath/
DIMOB 170.7 85.8 35.6 86.2
PAI IDA 163.2 68.0 32.2 83.7
Centroid 171.3 61.3 27.6 82.4
IslandPath/
DINUC 444.4 54.8 53.3 82.2
Alien
Hunter 1264.8 38.0 77.0 70.8
Literature 639.4 100 87.0 96.3
3.3 Comparison with previously published genomic islands
Although, there is no gold standard dataset of GIs, I wanted to examine
how previously published GIs overlapped with my datasets. Five strains from the
list of 118 had published GIs (Beres, et al., 2002; Hayashi, et al., 2001;
McClelland, et al., 2001; Parkhill, et al., 2001; Perna, et al., 2001). As with the
analysis of the sequence composition based GI predictors, I calculated the
overlap of the published GIs against the positive and negative dataset. I found,
potentially due in part to the similar manual comparative genomics methods
sometimes used to identify GIs in the literature dataset, that the literature GIs had
the most agreement with my datasets (versus the GI predictors evaluated below).
Literature GIs had the highest precision, recall, and overall accuracy of 100,
87%, and 96%, respectively, when using IslandPick-predicted islands as the text
dataset (Table 3.1).
3.4 Comparison of sequence composition based approaches
using additional GI datasets constructed with more relaxed
criteria.
IslandPick’s parameters can be modified to allow the prediction of GIs with
more ancient origins. Although the inclusion of more ancient GIs could lead to a
more comprehensive dataset, it may result in an increase in false positives since
the proper identification of older evolutionary events can be easily mistaken.
However, I did use two additional “relaxed” sets of parameters to determine the
effect on GI prediction of changing the default parameters. These relaxed
parameters should identify GIs with origins that are more ancient. The first
relaxed set used the same default parameters, except that the "Minimum
Distance Cutoff" was changed to 0.15 and the "Single Close Genome Cutoff"
changed to 0.34. The second set of parameters was even more relaxed by
increasing the “Single Close Genome Cutoff" to 0.20, with all other parameters
being the same as the first relaxed set.
The first “relaxed” dataset had approximately 46% more GIs predicted per
genome, while as expected the negative datasets stayed about the same size
with a 3% increase in the relaxed dataset. Notably, accuracy relative to the
literature dataset went down slightly (see Appendix File 3.5 and Appendix File
3.6), indicating that the IslandPick defaults do most accurately reflect literature-
based GI data. The sequence composition-based tools also all had a relative
the predictions from IslandPath-DIMOB and SIGI-HMM and found that there was
a large increase in recall/sensitivity to 48% ( from IslandPath-DIMOB (36%);
SIGI-HMM (33%)) and overall accuracy 88% (IslandPath-DIMOB (86%); SIGI-
HMM (86%)) while maintaining roughly the same precision/specificity 86%
(IslandPath-DIMOB (86%); SIGI-HMM (92%)) (data not shown). More analysis of
the differences in sequence composition between true positives and false
positives in this analysis could be insightful.
The results show that all GI predictors had a decrease in overall accuracy
when trying to predict more ancient islands. Considering that sequence
composition based predictors would have trouble detecting significant signals in
older GIs due to amelioration to the host genome, it was not surprising that the
overall accuracy for all tools decreased (Lawrence and Ochman, 1997). Alien
Hunter had the lowest decrease in overall accuracy however, it still maintained
the lowest precision and overall accuracy for the prediction of this dataset and
SIGI-HMM still out performed the other sequence composition-based tools for
predicting these more divergent islands. It is possible that the accuracy of some
of these sequence composition-based tools could be improved by optimizing
their parameters. However, out of all the tools, SIGI-HMM and Centroid were the
only ones with a clearly defined sensitivity/statistical parameter and even for
these there were no recommend suggestions besides the default. Although
default parameters for all tools are presumably maximized to result in the best
overall accuracy, some fine-tuning may improve their results.
are quickly performed on a computer cluster, while all dynamic web pages are
implemented using PHP.
4.3 Selection and integration of genomic island prediction
methods
The inclusion of particular GI prediction methods into IslandViewer were
based on several factors. The most obvious is that I could only consider using
methods that had obtainable software and could be run without manual
intervention. Therefore, many GI resources that are simply a database and have
no downloadable software such as Islander (Mantri and Williams, 2004) could not
be included into IslandViewer. In addition, I did not consider the inclusion of
MobilomeFINDER (Ou, et al., 2007), a tool that uses a comparative genomics
based approach similar to IslandPick because it requires the manual selection of
comparison genomes (making pre-computed results for all genomes impossible).
However, all of these methods are listed on IslandViewer’s “Resources” page
and users are recommended to visit their respective websites if interested.
For those tools that did have their software freely available, IslandPath-
DIMOB (Hsiao, et al., 2005) and SIGI-HMM (Waack, et al., 2006) were included
because they were shown to have the highest specificity (86-92%) and overall
accuracy (86%) (Chapter 3). In addition, the automated comparative genomics
method, IslandPick, was included since it provides predictions that are not based
on sequence composition and showed the most agreement with a manual
curated dataset of literature based GIs. These three methods sometimes predict
the same GIs, but often give slightly different results suggesting that they
Figure 4.1 A screenshot of the IslandViewer interface.
Once the genome of interest is selected it is presented as a circular
genome image with each predicted GI highlighted (different colours for different
tools in the IslandViewer) and is also available as a high-resolution image
suitable for publication. In addition to the predicted GIs for each tool,
IslandViewer highlights any GIs that have been predicted by two or more
methods. The annotations for genes within each GI can be quickly viewed by
hovering over the GI of interest within the image. Clicking on an island jumps to
the corresponding row in a table below the genome image and gives information
such as GI coordinates, links to tables showing genes and annotations within the
GI region, links to external genome viewers at NCBI and Joint Genome Institute
(JGI), and links to IslandPath to allow further examination of GI related features
in the genome of choice. GI predictions may be downloaded in various formats
including Excel, tab-delimited, comma-delimited, FASTA, and GenBank (allowing
easy input into the genome browser and annotation tool Artemis (Rutherford, et
al., 2000)). All datasets and source code are available for download under a
GNU GPL license.
4.5 Discussion
GI identification is becoming a first critical step in the characterization of a
bacterial genome, due to the growing appreciation for the role of GIs in important
adaptations of interest. Recent research has therefore focused on developing
new computational methods for their prediction. However, these methods tend to
use different approaches and identify different features of GIs. The result is that
the most accurate methods each have high precision, but low recall, leading to
slightly different regions being predicted. Previously, researchers could either
pick a single method or try to manually integrate the results from multiple
methods themselves. In addition, many of these tools did not have their own web
interfaces and often required that the user download and run the program on
their computer. IslandViewer alleviates these concerns by providing a web
interface for three accurate GI prediction methods that were not previously
available through a web interface. By pre-computing GI datasets for all
completed genomes and providing a single submission process for new user
genomes IslandViewer allows researchers access to a user-friendly resource that
can be used as the first step in GI analysis of bacterial genomes. It would be
expected that researchers would manually inspect any GI predictions shown in
IslandViewer to determine their validity and make more accurate predictions of
their boundaries. IslandViewer helps aid further analysis of GI predictions by
providing data in various formats that can be used in other bioinformatic tools
such as Artemis, and by providing numerous links to other GI resources.
IslandViewer should be a useful resource for any researcher studying GIs and
microbial genomes.
CHAPTER 5 THE ROLE OF GENOMIC ISLANDS IN THE
VIRULENT PSEUDOMONAS AERUGINOSA
LIVERPOOL EPIDEMIC STRAIN
Portions of this chapter have been previously published in the article “Newly
introduced genomic prophage islands are critical determinants of in-vivo
competitiveness in the Liverpool Epidemic Strain of Pseudomonas aeruginosa”,
co-authored by C. Winstanley, M.G.I. Langille, J.L. Fothergill, I. Kukavica-Ibrulj,
C. Paradis-Bleau, F. Sanschagrin, N. R. Thomson, G.L. Winsor, M.A. Quail, N.
Lennard, A. Bignell, L. Clarke, K. Seeger, D. Saunders, D. Harris, J. Parkhill, R.
E.W. Hancock, F.S.L. Brinkman, and R.C. Levesque in Genome Research,
Volume 19, Issue 1 ©2009 by Cold Spring Harbor Laboratory Press
5.1 Introduction
Pseudomonas aeruginosa is a ubiquitous organism distributed widely in
the environment, including the soil and water and in association with various
living host organisms. It is one of the most prevalent causes of opportunistic
infections in humans and is the most common cause of eventually fatal,
persistent respiratory infections in cystic fibrosis (CF) patients. It has been
assumed to owe its versatility to its genetic complexity. Sequencing of four
strains (Lee, et al., 2006; Mathee, et al., 2008; Stover, et al., 2000), and
molecular genetic analysis of others, has revealed an approximately 6-7 Mb
genome with around 5,500 ORFs. Based on comparisons of the first two P.
aeruginosa genomes sequenced, those of strains PA01 (Stover, et al., 2000) and
PA14 (Lee, et al., 2006) [the latter of which is the most common genotype
encountered in diverse habitats in one study of 240 isolates (Wiehlmann, et al.,
2007)], it was revealed that there is a quite highly conserved core genome
discovery of epidemic strains from the lungs of patients with CF provided an
unprecedented opportunity to address this issue.
The widespread assumption that CF patients acquire only unique strains
of P. aeruginosa from the environment was challenged when molecular typing
was used to demonstrate the spread of a β-lactam-resistant isolate, now known
as the Liverpool Epidemic Strain (LES), at a children’s CF unit in Liverpool, UK
(Cheng, et al., 1996). Subsequent identification of other CF epidemic strains in
the UK (Lewis, et al., 2005; Scott and Pitt, 2004) and Australia (Armstrong, et al.,
2003; O'Carroll, et al., 2004) indicate that transmissible P. aeruginosa strains
make a significant contribution to the infection of patients in some CF centres.
LES is the most frequent clone isolated from CF patients in England and Wales
(Scott and Pitt, 2004) and has also been reported in Scotland (Edenborough, et
al., 2004). In addition, LES can cause superinfection (McCallum, et al., 2001),
exhibits enhanced survival on dry surfaces (Panagea, et al., 2005), and is
associated with greater patient morbidity than other P. aeruginosa strains (Al-
Aloul, et al., 2004). In two unusual cases, transmission of an LES strain occurred
from a CF patient to both non-CF parents, causing significant morbidity and
infections that have persisted (McCallum, et al., 2002), and from a CF patient to
a pet cat (Mohan, et al., 2008). LES isolates, including isolate LESB58, exhibit an
unusual phenotype, characterised by early (in the growth curve) over-expression
of the cell-density-dependent quorum sensing regulon, including virulence-
related secreted factors such as LasA, elastase and pyocyanin (Fothergill, et al.,
2007; Salunkhe, et al., 2005). Furthermore, LESB58 is known to be a biofilm
to sequence and analyse the genome of the earliest archived LES isolate,
LESB58. LESB58 was obtained from a Liverpool CF patient in 1988, eight years
prior to the first published study on the LES (Cheng, et al., 1996). The LESB58
genome was sequenced by the Pathogen Production team at the Sanger
Institute and I led the genome annotation; including the identification of many
large GIs including five prophage clusters, one defective (pyocin) prophage
cluster and five non-phage islands. In addition, Roger Levesque’s research group
performed an unbiased signature tagged mutagenesis (STM) study, and
screening in a chronic rat lung infection model. I mapped these STM primer
sequence reads to determine the genes implicated in the pathogenesis of LES.
This study revealed genes from the prophage clusters that strongly impacted on
competitiveness in this chronic infection model, indicating that acquisition of
these prophage genes contributed to the success of the LES strain.
5.2 Genome annotation
I annotated the genome of LESB58, depicted in Figure 5.1 and with
statistics available in Table 5.1, using a combination of automated methods and
manual curation (see next paragraph). The genome is available through the
Pseudomonas Genome Database at www.Pseudomonas.com, which represents
a repository for all completed Pseudomonas genome sequences released
publicly to date (Winsor, et al., 2009).
Coding sequences (CDS) within LES were predicted using Glimmer3
(Delcher, et al., 2007) and were assigned LES locus identifiers consisting of a
“PLES_” prefix followed by five digits that are incremented in multiples of 10 to
resistance is derepression of the class-C chromosomal β-lactamase (PA4110),
and its homolog and those of all of the accessory regulatory genes are present in
the genome. Another major cause of multidrug resistance is derepression of the
expression of particular efflux pumps of which P. aeruginosa has a wide variety.
Mutations in certain efflux pump genes were observed. For example the positive
regulator of MexEFOprN, mexT (PA2492 homolog), was a pseudogene in
LESB58, while the mexF (PA2494) gene is present but mutated suggesting that
the MexEFOprN efflux system was minimally operative and perhaps not
derepressible in the LES. Similarly, MexZ (PA2020) was also a pseudogene.
However, the major efflux pump contributing to intrinsic and mutational
resistance MexABOprM, and the ancillary system MexCDOprJ were intact. In
other LES isolates exhibiting greater antimicrobial resistances, depression of
AmpC and mutations in mexR and mexZ, implicated in up-regulation of the
MexAB-OprM and MexXY efflux pumps respectively, have been identified
(Salunkhe, et al., 2005). Of the 31 PAO1 CDSs annotated as functional class
“antibiotic resistance and susceptibility” in the Pseudomonas Genome Database,
only PA2818 (arr), a putative aminoglycoside response regulator, was absent
from the genome of the LES.
5.3 Identification of prophage and genomic islands within LES
Prior to the sequencing of LESB58, previous studies used subtractive
hybridization to identify several regions that were not present in PAO1 and
further quantified the prevalence of these regions amongst LES and non-LES CF
isolates (Smart, et al., 2006). I refined these novel regions further and identified
several new GIs and prophage regions using IslandPick (see Chapter 2). The
exact boundaries of several of these regions were determined by Craig
Winstanley’s research group, by designing PCR primers reading out from each
terminal region, and sequencing the resultant amplicons (Table 5.4).
5.3.1 LES bacteriophage gene clusters
Isolate LESB58 contained six prophage gene clusters, termed here
prophages 1-6 (Table 5.4; Figure 5.2; Appendix File 5.1), of which four are
absent from strain PA01. The LES prophage 1 gene cluster was a defective
prophage predicted to encode pyocin R2. In strain PA01, two gene clusters in
tandem encode pyocin R2 and F2, both of which are predicted to be evolved
from phage tail genes. It has been demonstrated that either can be present or
absent in P. aeruginosa (Ernst, et al., 2003; Nakayama, et al., 2000). The LES
genome carried the pyocin R2 (P2 phage homolog) cluster (PLES06091-
PLES06271) but not the pyocin F2 (phage λ homolog) cluster. It also carried
pyocin S2 (PLES41691).
Figure 5.2 Phage clusters identified in LESB58 with significant similarities and
positioning of STM mutants after in-vivo screening.
PLES 15491 PLES 15961
4
PLES 25021 PLES 25661
5
Duplication 2
Duplication 1
PLES 13201 PLES 13711
3
Duplication 2
2
PLES 8321PLES 7891
Duplication 1
PLES 6091 PLES 6271
1
PLES 41181 PLES 41281
6
Pseudomonas Phage F10
Pseudomonas Phage D3112
Pyocin R2 Pseudomonas Phage D3
STM Mutations
Pseudomonas Phage Pf1
5 kb
The LES prophage 2 gene cluster is 42.1 kb long and includes 44 CDSs of
which 32 are homologous to the sequenced bacteriophage F10 (Kwan, et al.,
2006), a member of the Siphoviridae family. Where orthologs were detected,
synteny was maintained between the two phage genomes, but matching regions
were interspersed with non-matching CDSs (Appendix File 5.1).
The LES prophage 3 gene cluster was 42.8 kb and included 53 CDSs. A
13.6 kb region of this prophage, comprising 16 CDSs, shared 82.2% identity with
a region of prophage 2 with homology to bacteriophage F10. Much of the rest of
LES prophage 3 was similar to a region of the P. aeruginosa strain 2192
genome. However, LES prophage 3 also contained a 7.5 kb region (11 CDSs)
with 99.8% identity to a region of LES prophage 5. LES prophage 4 shared a
high level of similarity with the transposable phage D3112 (Wang, et al., 2004)
but with some variation, especially at one terminus. LES prophage 5 had
considerable similarity to bacteriophage D3 (Kropinski, 2000), although there was
evidence of substantial genetic rearrangements (Figure 5.2).
The LES prophage 6 gene cluster was similar to the genome of
bacteriophage Pf1 (Hill, et al., 1991). It has been suggested that Pf1 genes might
be important in CF infections, in that Pf1 genes are up-regulated under
conditions of reduced oxygen supply (Platt, et al., 2008), implicated in the
augmentation of the antimicrobial efficacy of antibiotics (Hagens, et al., 2006),
and play an active role in the activity and adaptation of P. aeruginosa populations
biofilms (Mooij, et al., 2007; Sauer, et al., 2004; Webb, et al., 2004; Webb, et al.,
2003). However, since most clinical isolates carry Pf1-like phages, these
activities are not restricted to successful CF strains such as the LES (Finnan, et
al., 2004).
Table 5.4 Identified genomic islands and prophage regions.
Region
Name
Integration
Site
Relative To
PAO1
Approximate start
position Number
of
Genes
Characteristics
Starta Enda
Sequence
Composition
Biasb
Mobility Gene(s)
Present
Prophage1
PA0611 -
PA0649
665561 680385 19 No None
Prophage 2
PA4138 -
PA4139
863875 906018 44 Yes Integrase
Prophage 3
PA3663 -
PA3664
1433756 1476547 53 Yes Integrase
Prophage 4
PA3463 -
PA3464
1684045 1720850 48 No Transposase
LESGI-1
PA2727 -
PA2737
2504700 2551100 31 Yes
Transposases &
Integrases
Prophage 5
PA2603 -
PA2604
2690450 2740350 65 Yes Integrase
LESGI-2
PA2593 -
PA2594
2751800 2783500 18 No None
LESGI-3
PA2583 -
PA2584
2796836 2907406 107 Yes Integrase
LESGI-4
PA2217 -
PA2229
3392800 3432228 32 Yes None
Prophage 6
PA1191 -
PA1192
4545190 4552788 12 Yes Integrase
LESGI-5
PA0831 -
PA0832
4931528 4960941 26 Yes Integrase
a The approximate start and end positions are given for those regions without
PCR analysis, except for Prophages 2 and 3 and LESGI-5.
b Sequence composition bias is indicated if the majority of the region was found
to have sequence bias by either Alien Hunter (Vernikos et al., 2006) or the
IslandPick-DIMOB (Hsiao et al., 2005) method.
5.3.2 LES genomic islands
The observed five LESB58 GIs are summarized in Table 5.4, depicted in
Figure 5.3, and described in greater detail in Appendix File 5.1.
Many GIs have been identified in P. aeruginosa strains in previous
studies; including, PAGI-1 (Liang, et al., 2001), PAGI-2 and PAGI-3 (Larbig, et
al., 2002), PAGI-4 (Klockgether, et al., 2004), PAGI-5 (Battle, et al., 2008), PAGI-
6 to PAGI-11(Battle, et al., 2009), PAPI-1 and PAPI-2 (He, et al., 2004), and
pKLC102 (Klockgether, et al., 2004). Only two of the five GIs identified within the
LES strain showed similarity to any previously identified P. aeruginosa island,
with the last 67 kb of the 110 kb LESGI-3 island showing similarity to PAGI-2,
PAGI-3, PAGI-5 and PAPI-1 (Figure 5.4), while LESGI-4 shared 46% identity
with PAGI-1 over its entire length. As previously noted, pKLC102 and the related
PAPI-1 were not found within the LES strain (Wurdemann and Tummler, 2007).
In addition, PAGI-4 and PAGI-6 to PAGI-11 showed no significant homologs in
the LESB58 genome.
LESGI-1 is inserted at a tRNA locus, and contains phage- and
transposon-related CDSs. However, it also contained several CDSs sharing
similarity with predicted proteins from non-pseudomonads such as the
thermophilic anaerobe Clostridium thermocellum and the marine bacteria
Marinobacter sp. Although mostly matching hypothetical proteins of no known
function, the island included homologs of regulatory proteins, restriction-
modification proteins, an ATPase and a sensor-kinase. This island included
PALES23591, which contains the LES-F9 marker, although it is not unique to
LES isolates (Smart, et al., 2006).
Figure 5.3 GIs identified in LESB58 with significant similarities and positioning of STM mutants after in-vivo screening.
previously demonstrated a competitive advantage over other P. aeruginosa
strains in relevant animal models of infection (Kukavica-Ibrulj, et al., 2008), a
STM analysis was performed on LESB58 by Dr. Roger Levesque’s lab.
Of the 60 LESB58 STM mutants that were attenuated in lung infection, I
was able to map 47 of them to an unambiguous sequence location (Table 5.5).
Six of these genes were also found in a previous STM screening using strain
PA01 (Table 5.5). DNA sequencing revealed insertions in most known functional
gene classes. These included insertions in genes encoding products or
processes previously implicated in pathogenesis of P. aeruginosa, such as the
type III secretion protein PscH, a haem iron uptake receptor PhuR, TolA, the
fimbrial usher CupA3, the alginate biosynthesis protein MucD, and two
transcriptional regulators PLES27111 and PLES33031. Insertions in genes
involved in the biosynthesis of type III pyoverdine (pvdE) and pyochelin
(PLES07011) were identified, emphasizing the importance of both siderophores.
Table 5.5 List of 47 LESB58 virulence associated genes.
Identified by PCR-based screening of 9216 STM mutants after passage
through the chronic rat lung agar bead infection model.
STM
Mutants
Insertion
Site in LES
genome
PAO1a
ortholog Putative function / comments
L103T13G PLES00271 PA0028 Hypothetical protein
L28T5G PLES03211 PA0325 Putative permease of ABC transporter
L70T18G PLES03331 PA0336 Nudix hydrolase YgdP
L64T24G PLES03721 PA0375
Cell division ABC transporter, permease
protein FtsX
L52T19T PLES04001 PA0402 PyrB Aspartate carbamoyltransferase
L114T20G PLES06181 PA0622 Put. phage tail sheath protein/pyocin R2
(LES prophage 1)
L15T13G PLES07011 PA4226 Dihydroaeruginoic acid synthetase
L124T1G PLES08021 None
DNA replication protein DnaC (LES
prophage 2)
L114T14G PLES08731 PA4100 Probable dehydrogenase
L6T19G PLES08751 PA4098 Probable short-chain dehydrogenase
L113T14T PLES10401 PA3936
Probable permease of ABC taurine
transporter
L124T11G PLES13181 PA3666 Tetrahydrodipicolinate succinylase
L94T20G PLES13261 None Hypothetical protein (LES prophage 3)b
L111T2G PLES19021 PA3166 Chorismate mutase
L14T10G PLES22061 PA2858 Putative ABC transporter, permease protein
L111T13T PLES22341 PA2831 Putative zinc carboxypeptidase
L106T24G PLES23991 PA2705 Hypothetical protein
L52T24G PLES23991 PA2705 Hypothetical protein
L52T5T PLES23991 PA2705 Hypothetical protein
L14T9G PLES24551 PA2650 Putative methyltransferase
L58T23G PLES25621 None Putative lytic enzyme (LES prophage 5)c
L19T13G PLES27111 PA2583 Probable sensor /response regulator hybrid
L70T1G PLES29051 None
PvdE; component of type III pyoverdine
locus
L113T14G PLES31971 PA2130 CupA3, fimbrial usher protein
L110T9G PLES33001 PA2023
UTP-glucose-1-phosphate
uridylyltransferase
L110T14G PLES33031 PA2020 Probable transcriptional regulator
L13T13G PLES33821 PA1941 Hypothetical protein
L124T10G PLES33821 PA1941 Hypothetical protein
L82T13G PLES34271 PA1897 Putative desaturase
L13T19G PLES36081 PA1721 Type III export protein PscH
STM
Mutants
Insertion
Site in LES
genome
PAO1a
ortholog Putative function / comments
L109T23T PLES37591 PA1569
Prob major facilitator superfamily (MFS)
transporter
L25T11T PLES39641 PA1449 Flagellar biosynthetic protein FlhBd
L106T19G PLES41401 PA1181 Conserved hypothetical protein
L54T20T PLES41751 PA1144
Probable major facilitator superfamily (MFS)
transporter
L54T13T PLES43701 PA0945
PurM, phosphoribosylaminoimidazole
synthetase
L57T4G PLES45041 None Hypothetical protein (LES GI-5)
L65T15G PLES45141 PA0829 Probable hydrolase
L19T14G PLES45311 PA0811
Probable major facilitator superfamily (MFS)
transporter
L22T17G PLES45771 PA0766 Serine protease MucD precursor
L121T13G PLES46381 PA0692 Hypothetical protein
L64T1G PLES46641 PA4284 Exodeoxyribonuclease V beta chain
L10T7G PLES47381 PA4360 Putative chromosome segregation ATPase
L14T13G PLES50951 PA4710
Putative haem uptake outer membrane
receptor PhuR
L20T20G PLES53911 PA5002 Hypothetical protein
L21T13G PLES55011 PA5111 Lactoylglutathijne lyase
L61T13G PLES56651 PA5271 Hypothetical protein
L127T13G PLES57621 PA5367
ABC phosphate transporter membrane
component
aGenes previously identified by STM screening of P. aeruginosa strain PAO1 (or present in the
same operon as previously identified genes) are indicated in bold.
bThis location was tentatively identified as it is within a duplicated region shared by LES prophage
5
cSince it is likely that gene PLES25621 would not be expressed in a lysogen, it seems probable
that the insertion in gene PLES25621 had a polar effect on downstream genes, affecting the
expression of PLES25631, PLES26641and PLES25651, which are known to be part of LES
prophage 5.
dSince the parent strain LES5B is relatively deficient in swimming motility which depends of
flagella function (Table 5.2), it is hypothesized that the observation of this mutation within the
characterized STM mutants reflects either an importance for the residual motility function, an
alternative function for FlhB (e.g. in a Type III-like secretion event or adherence) or polar effects
on one of the downstream genes.
5.4.1 In-vivo analysis of STM mutants having insertions in prophage and
genomic islands
To assist in understanding the basis for the successful colonization of the
LES in CF patients, the level of attenuation in-vivo was determined by Dr.
Levesque’s research group for 3 STM mutants having insertions in LES
prophages -2, -3 and -5 and one STM mutant in the unique LES GI, LESGI-5
(Table 5.4). In-vitro growth was assessed for each of these STM mutants in
mixed cultures with the wild-type (in-vitro competitive index [CI]) to confirm that
these mutants did not affect in-vitro growth, and were not out-competed in-vitro
by the wild-type LESB58 strain, yielding an in-vitro competitive index of around
1.0 after 18 hr in BHI broth. This contrasted with the results when competition
was assessed in-vivo, for which the mutants were mixed with the wild-type strain
LESB58 and grown in the rat lung infection model for 7 days. As depicted in
Figure 5.5, mutants with insertions in both Prophages 2 and 5 caused a severe
defect in growth and maintenance in-vivo which gave a significant 16- to 58- fold
decrease of CFUs in rat lung tissues with competitive index values of 0.061 and
0.017, respectively. Mutants in Prophage 3 and LESGI-5 could be partially
maintained in lung tissues with approximately 7-fold decreases in growth in-vivo.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



