Minimally overlapping words for sequence similarity search

15Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.

Cite

CITATION STYLE

APA

Frith, M. C., Noé, L., & Kucherov, G. (2020). Minimally overlapping words for sequence similarity search. Bioinformatics, 36(22–23), 5344–5350. https://doi.org/10.1093/bioinformatics/btaa1054

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free