Predicting protein-binding regions in RNA using nucleotide profiles and compositions

Daesik Choi; Byungkyu Park; Hanju Chae; Wook Lee; Kyungsook Han

Journal ArticleOPEN ACCESS

Predicting protein-binding regions in RNA using nucleotide profiles and compositions

BMC Systems Biology (2017) 11

DOI: 10.1186/s12918-017-0386-4

19Citations

22Readers

Abstract

Background: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. Results: We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. Conclusions: Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding.

Author supplied keywords

Cite

CITATION STYLE

APA

Choi, D., Park, B., Chae, H., Lee, W., & Han, K. (2017). Predicting protein-binding regions in RNA using nucleotide profiles and compositions. BMC Systems Biology, 11. https://doi.org/10.1186/s12918-017-0386-4

Predicting protein-binding regions in RNA using nucleotide profiles and compositions

Abstract

Author supplied keywords

Cite

Register to see more suggestions