Feature frequency profiles for automatic sample identification using PySpark

0Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

When the identity of a next generation sequencing sample is lost, reads or assembled contigs are aligned to a database of known genomes and classified as the match with the most hits. However, any alignment based methods are very expensive when dealing with millions of reads and several thousand genomes with homologous sequences. Instead of relying on alignment, samples and references could be compared and classified by their feature frequency profiles (FFP), which is similar to the word frequency profile (n-gram) used to compare bodies of text. The FFP is also ideal in a metagenomics setting to reconstruct a mixed sample from a pool of reference profiles using a linear model or optimization techniques. To test the robustness of this method, an assortment of samples will be matched to complete references from NCBI Genome. Since a MapReduce framework is ideal for calculating feature frequencies in parallel, this method will be implemented using the PySpark API and run at scale on Wrangler, an XSEDE system designed for big data analytics.

Cite

CITATION STYLE

APA

Zynda, G., Gaffney, N., Dalkilic, M., & Vaughn, M. (2015). Feature frequency profiles for automatic sample identification using PySpark. In Proceedings of PyHPC 2015: 5th Workshop on Python for High-Performance and Scientific Computing - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, Inc. https://doi.org/10.1145/2835857.2835862

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free