This article is free to access.
Background: The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets. Results: The major findings are: pairs of genes that share a domain have an increased Spearman's correlation coefficients of counts; Conclusions: The effect of sharing a domain is likely more a results of real biological co-expression than just sequence similarity and artifacts of mapping and counting. Still, this is more difficult to conclude and needs further research. The effect is interesting itself, but we also point out some practical aspects in which it may influence the RNA sequencing analysis and RNA biomarker use. In particular it means that a gene signature biomarker set build out of RNA-sequencing results should be depleted for genes sharing common domains. It may cause to perform better when applying classification. Reviewers: This article was reviewed by Dimitar Vassiliev and Susmita Datta.
Leśniewska, A., Zyprych-Walczak, J., Szabelska-Berȩsewicz, A., & Okoniewski, M. J. (2018). Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures. Biology Direct, 13(1). https://doi.org/10.1186/s13062-018-0205-x