Many microorganisms exhibit high levels of intragenic recombination following horizontal gene transfer events. Furthermore, many microbial genes are subject to strong diversifying selection as part of the pathogenic process. A multiple sequence alignment is an essential starting point for many of the tools that provide fundamental insights on gene structure and evolution, such as phylogenetics; however, an accurate alignment is not always possible to attain. In this study, a new analytic approach was developed in order to better quantify the genetic organization of highly diversified genes whose alleles do not align. This BLAST-based method, denoted BLAST Miner, employs an iterative process that places short segments of highly similar sequence into discrete datasets that are designated "modules." The relative positions of modules along the length of the genes, and their frequency of occurrence, are used to identify sequence duplications, insertions, and rearrangements. Partial alleles of sof from Streptococcus pyogenes, encoding a surface protein under host immune selection, were analyzed for module content. High-frequency Modules 6 and 13 were identified and examined in depth. Nucleotide sequences corresponding to both modules contain numerous duplications and inverted repeats, whereby many codons form palindromic pairs. Combined with evidence for a strong codon usage bias, data suggest that Module 6 and 13 sequences are under selection to preserve their nucleic acid secondary structure. The concentration of overlapping tandem and inverted repeats within a small region of DNA is highly suggestive of a mechanistic role for Module 6 and 13 sequences in promoting aberrant recombination. Analysis of pbp2X alleles from Streptococcus pneumoniae, encoding cell wall enzymes that confer antibiotic resistance, supports the broad applicability of this tool in deciphering the genetic organization of highly recombined genes. BLAST Miner shares with phylogenetics the important predictive quality that leads to the generation of testable hypotheses based on sequence data.
Wertz, J. E., McGregor, K. F., & Bessen, D. E. (2007). Detecting key structural features within highly recombined genes. PLoS Computational Biology, 3(1), 0137–0150. https://doi.org/10.1371/journal.pcbi.0030014