Big-SeqDB-Gen: A formal and scalable approach for parallel generation of big synthetic sequence databases

Rim Moussa

Conference Proceedings

Big-SeqDB-Gen: A formal and scalable approach for parallel generation of big synthetic sequence databases

Moussa R

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9508 61-76

DOI: 10.1007/978-3-319-31409-9_5

1Citations

4Readers

Get full text

Abstract

The recognition that data is of big economic value and the significant hardware achievements in low cost data storage, high-speed networks and high performance parallel computing, foster new research directions on large-scale knowledge discovery from big sequence databases. There are many applications involving sequence databases, such as customer shopping sequences, web clickstreams, and biological sequences. All these applications are concerned by the big data problem. There is no doubt that fast mining of billions of sequences is a challenge. However, due to the non availability of big data sets, it is not possible to assess knowledge discovery algorithms over big sequence databases. For both privacy and security concerns, Companies do not disclose their data. In the other hand, existing synthetic sequence generators are not up to the big data challenge. In this paper, first we propose a formal and scalable approach for Parallel Generation of Big Synthetic Sequence Databases. Based on Whitney numbers, the underlying Parallel Sequence Generator (i) creates billions of distinct sequences in parallel and (ii) ensures that injected sequential patterns satisfy user-specified sequences’ characteristics. Second, we report a scalability and scale-out performance study of the Parallel Sequence Generator, for various sequence databases’ sizes and various number of Sequence Generators in a shared-nothing cluster of nodes.

Author supplied keywords

Cite

CITATION STYLE

APA

Moussa, R. (2016). Big-SeqDB-Gen: A formal and scalable approach for parallel generation of big synthetic sequence databases. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9508, pp. 61–76). Springer Verlag. https://doi.org/10.1007/978-3-319-31409-9_5

Big-SeqDB-Gen: A formal and scalable approach for parallel generation of big synthetic sequence databases

Abstract

Author supplied keywords

Cite

Register to see more suggestions