Compressing Genomic Sequences by Using Deep Learning

Wenwen Cui; Zhaoyang Yu; Zhuangzhuang Liu; Gang Wang; Xiaoguang Liu

Conference Proceedings

Compressing Genomic Sequences by Using Deep Learning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12396 LNCS 92-104

DOI: 10.1007/978-3-030-61609-0_8

12Citations

6Readers

Get full text

Abstract

Huge amount of genomic sequences have been generated with the development of high-throughput sequencing technologies, which brings challenges to data storage, processing, and transmission. Standard compression tools designed for English text are not able to compress genomic sequences well, so an effective dedicated method is needed urgently. In this paper, we propose a genomic sequence compression algorithm based on a deep learning model and an arithmetic encoder. The deep learning model is structured as a convolutional layer followed by an attention-based bi-directional long short-term memory network, which predicts the probabilities of the next base in a sequence. The arithmetic encoder employs the probabilities to compress the sequence. We evaluate the proposed algorithm with various compression approaches, including a state-of-the-art genomic sequence compression algorithm DeepDNA, on several real-world data sets. The results show that the proposed algorithm can converge stably and achieves the best compression performance which is even up to 3.7 times better than DeepDNA. Furthermore, we conduct ablation experiments to verify the effectiveness and necessity of each part in the model and implement the visualization of attention weight matrix to present different importance of various hidden states for final prediction. The source code for the model is available in Github (https://github.com/viviancui59/Compressing-Genomic-Sequences).

Author supplied keywords

Cite

CITATION STYLE

APA

Cui, W., Yu, Z., Liu, Z., Wang, G., & Liu, X. (2020). Compressing Genomic Sequences by Using Deep Learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12396 LNCS, pp. 92–104). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-61609-0_8

Compressing Genomic Sequences by Using Deep Learning

Abstract

Author supplied keywords

Cite

Register to see more suggestions