Feature frequency profiles for automatic sample identification using PySpark

Gregory Zynda; Niall Gaffney; Mehmet Dalkilic; Matthew Vaughn

Conference Proceedings

Feature frequency profiles for automatic sample identification using PySpark

Proceedings of PyHPC 2015: 5th Workshop on Python for High-Performance and Scientific Computing - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis (2015)

DOI: 10.1145/2835857.2835862

0Citations

15Readers

Get full text

Abstract

When the identity of a next generation sequencing sample is lost, reads or assembled contigs are aligned to a database of known genomes and classified as the match with the most hits. However, any alignment based methods are very expensive when dealing with millions of reads and several thousand genomes with homologous sequences. Instead of relying on alignment, samples and references could be compared and classified by their feature frequency profiles (FFP), which is similar to the word frequency profile (n-gram) used to compare bodies of text. The FFP is also ideal in a metagenomics setting to reconstruct a mixed sample from a pool of reference profiles using a linear model or optimization techniques. To test the robustness of this method, an assortment of samples will be matched to complete references from NCBI Genome. Since a MapReduce framework is ideal for calculating feature frequencies in parallel, this method will be implemented using the PySpark API and run at scale on Wrangler, an XSEDE system designed for big data analytics.

Author supplied keywords

Cite

CITATION STYLE

APA

Zynda, G., Gaffney, N., Dalkilic, M., & Vaughn, M. (2015). Feature frequency profiles for automatic sample identification using PySpark. In Proceedings of PyHPC 2015: 5th Workshop on Python for High-Performance and Scientific Computing - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, Inc. https://doi.org/10.1145/2835857.2835862

Feature frequency profiles for automatic sample identification using PySpark

Abstract

Author supplied keywords

Cite

Register to see more suggestions