Background: The advent of next-generation RNA sequencing (RNA-seq) has greatly advanced transcriptomic studies, including system-wide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a high-throughput manner from RNA-seq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions. Results: We have developed a preselection method called "Non-negative Matrix Factorization Preselection" (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNA-seq data. We demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms. Conclusion: NMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNA-seq data.
Ye, Y., & Li, J. J. (2016). NMFP: A non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data. BMC Genomics, 17(1). https://doi.org/10.1186/s12864-015-2304-8