Mining statistically-solid k-mers for accurate NGS error correction

10Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f 0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance. Results: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f 0 . To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. Conclusion: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

Cite

CITATION STYLE

APA

Zhao, L., Xie, J., Bai, L., Chen, W., Wang, M., Zhang, Z., … Li, J. (2018). Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics, 19. https://doi.org/10.1186/s12864-018-5272-y

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free