DupChecker: A bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

Quanhu Sheng; Yu Shyr; Xi Chen

Journal ArticleOPEN ACCESS

DupChecker: A bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

BMC Bioinformatics (2014) 15(1)

DOI: 10.1186/1471-2105-15-323

2Citations

27Readers

Abstract

Background: Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis. Results: We developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package. Conclusions: Researchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step.

Cite

CITATION STYLE

APA

Sheng, Q., Shyr, Y., & Chen, X. (2014). DupChecker: A bioconductor package for checking high-throughput genomic data redundancy in meta-analysis. BMC Bioinformatics, 15(1). https://doi.org/10.1186/1471-2105-15-323

DupChecker: A bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

Abstract

Cite

Register to see more suggestions