A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data

16Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: Most of the approaches for splice site prediction are based on machine learning techniques. Though, these approaches provide high prediction accuracy, the window lengths used are longer in size. Hence, these approaches may not be suitable to predict the novel splice variants using the short sequence reads generated from next generation sequencing technologies. Further, machine learning techniques require numerically encoded data and produce different accuracy with different encoding procedures. Therefore, splice site prediction with short sequence motifs and without encoding sequence data became a motivation for the present study. Results: An approach for finding association among nucleotide bases in the splice site motifs is developed and used further to determine the appropriate window size. Besides, an approach for prediction of donor splice sites using sum of absolute error criterion has also been proposed. The proposed approach has been compared with commonly used approaches i.e., Maximum Entropy Modeling (MEM), Maximal Dependency Decomposition (MDD), Weighted Matrix Method (WMM) and Markov Model of first order (MM1) and was found to perform equally with MEM and MDD and better than WMM and MM1 in terms of prediction accuracy. Conclusions: The proposed prediction approach can be used in the prediction of donor splice sites with higher accuracy using short sequence motifs and hence can be used as a complementary method to the existing approaches. Based on the proposed methodology, a web server was also developed for easy prediction of donor splice sites by users and is available at http://cabgrid.res.in:8080/sspred.

Figures

  • Figure 1 Heat map of TSS and FSS. Heat maps of (a) TSS and (b) FSS were generated by using corresponding association matrices. Association matrices were generated by taking 20 positions (10 positions at the exon end and 10 positions excluding GT at the intron start). Since each position corresponds to four indicator variables, hence the heat map generated is of order 80 × 80 units and the units between 29–40 indicates 3 bp at the exon end and 41–64 units for 6 bp at the intron start. There exist distinct association pattern among the positions around the conserved di-nucleotide GT in TSS. On the other hand, such association pattern is absent in case of FSS.
  • Figure 2 Percentage of similarity within and between TSS and FSS. It (b) within FSS (c) TSS with FSS (d) FSS with TSS. The value inside parenthe percentage of similarity (same color) shown below the parenthesis. It can b and between TSS and FSS.
  • Table 1 Threshold values and estimates of AUC-ROC for the proposed approach under different window sizes
  • Figure 3 ROC curves for the proposed approach under balanced situation with different window length (WL).
  • Table 2 Number of non-redundant TSS and FSS sequence under different degrees of imbalanced-ness
  • Figure 4 ROC and PR curves for the proposed prediction approach. (a) ROC curves and (b) PR curves are plotted using sensitivity and specificity, obtained from the test sets of 10-fold cross validation, under different degrees of imbalanced-ness. Red color curve denotes the curve for the balanced data. Green, blue and purple are the curves for the dataset with different degrees of imbalanced-ness indicated as legend. Legends for PR curves are same as the legends for ROC curves.
  • Figure 5 ROC and PR curves for different splice site prediction approaches using HS3D dataset. (A) ROC curves and (B) PR curves for the proposed (SAE) and other considered approaches in prediction of donor splice sites are plotted for (a) balanced dataset and imbalanced dataset having unequal number TSS and FSS i.e., (b) 1960 & 5000, (c) 1960 & 10000 and (d) 1960 & 15000 respectively.
  • Table 3 Estimates of AUC-ROC and AUC-PR for the proposed (non-redundant case)

References Powered by Scopus

The use of the area under the ROC curve in the evaluation of machine learning algorithms

5578Citations
N/AReaders
Get full text

Prediction of complete gene structures in human genomic DNA

3375Citations
N/AReaders
Get full text

Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals

1565Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou's general PseAAC

378Citations
N/AReaders
Get full text

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA

35Citations
N/AReaders
Get full text

DeepSS: Exploring Splice Site Motif Through Convolutional Neural Network Directly from DNA Sequence

29Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Meher, P. K., Sahu, T. K., Rao, A. R., & Wahi, S. D. (2014). A statistical approach for 5’ splice site prediction using short sequence motifs and without encoding sequence data. BMC Bioinformatics, 15(1). https://doi.org/10.1186/s12859-014-0362-6

Readers over time

‘14‘15‘16‘17‘18‘19‘20‘21‘22‘23‘2402468

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 9

75%

Researcher 3

25%

Readers' Discipline

Tooltip

Computer Science 5

38%

Biochemistry, Genetics and Molecular Bi... 4

31%

Engineering 2

15%

Agricultural and Biological Sciences 2

15%

Save time finding and organizing research with Mendeley

Sign up for free
0