Search schemes enable the efficient identification of all approximate occurrences of a search pattern in a text. Using a bidirectional FM-index, search schemes describe how to explore the search space in such a way that runtime is minimized. Even though in-index matching has an optimal time complexity, relatively expensive random memory access is required for elementary operations on the FM-index. We analyze to what extent in-index matching can be complemented with in-text verification where a candidate occurrence is directly validated in the text using a bit-parallel, pairwise alignment procedure. We find that hybrid in-index/in-text matching can reduce the running time by more than a factor of two, compared to pure in-index matching. We present Columba 1.1, an open-source (AGPL-3.0 license) software tool written in C++ that efficiently implements these ideas. Using a single CPU core, Columba 1.1 can identify, within a maximum edit distance of four, all occurrences of 100 000 Illumina reads (150 bp) in the human reference genome in roughly half a minute. This significantly outperforms existing, state-of-the-art tools.
CITATION STYLE
Renders, L., Depuydt, L., & Fostier, J. (2022). Approximate Pattern Matching Using Search Schemes and In-Text Verification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13347 LNBI, pp. 419–435). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-07802-6_36
Mendeley helps you to discover research relevant for your work.