NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Sunyoung Kwon; Gyuwan Kim; Byunghan Lee; Jongsik Chun; Sungroh Yoon; Young Han Kim

Journal ArticleOPEN ACCESS

NASCUP: Nucleic Acid Sequence Classification by Universal Probability

IEEE Access (2021) 9 162779-162791

DOI: 10.1109/ACCESS.2021.3127957

0Citations

16Readers

Abstract

Nucleic acid sequence classification is a fundamental task in the field of bioinformatics. Due to the increasing amount of unlabeled nucleotide sequences, fast and accurate classification of them on a large scale has become crucial. In this work, we developed NASCUP, a new classification method that captures statistical structures of nucleotide sequences by compact context-tree models and universal probability from information theory. A comprehensive experimental study involving nine public databases for functional non-coding RNA, microbial taxonomy and coding/non-coding RNA classification demonstrates the advantages of NASCUP over widely-used alternatives in efficiency, accuracy, and scalability across all datasets considered. NASCUP achieved BLAST-like classification accuracy consistently for several large-scale databases in orders-of-magnitude reduced runtime, and was applied to other bioinformatics tasks such as outlier detection and synthetic sequence generation.

Author supplied keywords

Cite

CITATION STYLE

APA

Kwon, S., Kim, G., Lee, B., Chun, J., Yoon, S., & Kim, Y. H. (2021). NASCUP: Nucleic Acid Sequence Classification by Universal Probability. IEEE Access, 9, 162779–162791. https://doi.org/10.1109/ACCESS.2021.3127957

NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Abstract

Author supplied keywords

Cite

Register to see more suggestions