Lightweight BWT construction for very large string collections

  • Bauer M
  • Cox A
  • Rosone G
  • 27

    Readers

    Mendeley users who have this article in their library.
  • 25

    Citations

    Citations of this article.

Abstract

A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of theBWT in compression and indexing are well known, but the computational de- mands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context. We address this obstacle by presenting two algorithms capable of com- puting the BWT of very large string collections. The algorithms are lightweight in that the first needs O(mlogm)bits of memoryto process m strings and the memory requirements of the second are constant with respect to m. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. Although our tests were on collections of DNA sequences of uniform length, the algorithms themselves apply to any string collection over any alphabet.

Author-supplied keywords

  • BWT
  • next-generation sequencing
  • text indexes

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Get full text

Authors

  • Markus J. Bauer

  • Anthony J. Cox

  • Giovanna Rosone

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free