Abstract
Pseudouridine ( $\Psi$ ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. With the increasing availability of genomic and proteomic samples, computer-aided pseudouridine-synthase-specific $\Psi $ site recognition is becoming possible. In this paper, we propose an ensemble approach to identify pseudouridine sites, named EnsemPseU. First, five sequence-encoding strategies, namely, kmer, binary encoding, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND), were applied to extract sequence information. Then, chi-square feature selection was used to reduce the feature dimensionality and remove redundant information. Finally, an ensemble algorithm integrating support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) was used to build our prediction model. Upon testing, the results showed that the accuracy improved 5.3% for H. sapiens, 6.09% for S. cerevisiae, and 5.55% for M. musculus after chi-square feature selection. Moreover, upon evaluation via 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model. The source code and data sets are available at https://github.com/biyue1026/EnsemPseU.
Author supplied keywords
Cite
CITATION STYLE
Bi, Y., Jin, D., & Jia, C. (2020). EnsemPseU: Identifying Pseudouridine Sites with an Ensemble Approach. IEEE Access, 8, 79376–79382. https://doi.org/10.1109/ACCESS.2020.2989469
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.