Genomics studies have increasingly had to deal with datasets containing high variation between the sequenced nucleotide chains. This is most common in metagenomics studies and polyploid studies, where the biological nature of studied samples requires analysis of multiple variants of nearly identical sequences. The high variation makes it more difficult to determine the correct nucleotide sequences, as well as to distinguish signal from noise, producing digital results with higher error rates than the ones that can be achieved in samples with low variation. This paper presents an original pure machine learning-based approach for detecting and potentially correcting those errors. It uses a generic machine learning-based model that can be applied to different types of sequencing data with minor modifications. As presented in a separate part of this work, these models can be combined with data-specific error candidate selection to apply the models on, for a refined error discovery, but as shown here, can also be used independently.
CITATION STYLE
Krachunov, M., Nisheva, M., & Vassilev, D. (2018). Machine learning-driven noise separation in high variation genomics sequencing datasets. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11089 LNAI, pp. 173–185). Springer Verlag. https://doi.org/10.1007/978-3-319-99344-7_16
Mendeley helps you to discover research relevant for your work.