Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome analysis.
- PubMed: 9322041
Abstract
Translation in eukaryotes does not always start at the first AUG in an mRNA, implying that context information also plays a role. This makes prediction of translation initiation sites a non-trivial task, especially when analysing EST and genome data where the entire mature mRNA sequence is not known. In this paper, we employ artificial neural networks to predict which AUG triplet in an mRNA sequence is the start codon. The trained networks correctly classified 88% of Arabidopsis and 85% of vertebrate AUG triplets. We find that our trained neural networks use a combination of local start codon context and global sequence information. Furthermore, analysis of false predictions shows that AUGs in frame with the actual start codon are more frequently selected than out-of-frame AUGs, suggesting that our networks use reading frame detection. A number of conflicts between neural network predictions and database annotations are analysed in detail, leading to identification of possible database errors.
Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome analysis.
Perspectives for EST and Genome analysis.
Anders Gorm Pedersen¤ and Henrik Nielseny
Center for Biological Sequence Analysis
The Technical University of Denmark, Building 206
DK-2800 Lyngby, Denmark
Abstract
Translation in eukaryotes does not always start at the first
AUG in an mRNA, implying that context information also
plays a role. This makes prediction of translation initiation
sites a non-trivial task, especially when analysing EST and
genome data where the entire mature mRNA sequence is
not known. In this paper, we employ artificial neural net-
works to predict which AUG triplet in an mRNA sequence
is the start codon. The trained networks correctly classified
88 % of Arabidopsis and 85 % of vertebrate AUG triplets.
We find that our trained neural networks use a combination
of local start codon context and global sequence informa-
tion. Furthermore, analysis of false predictions shows that
AUGs in frame with the actual start codon are more fre-
quently selected than out-of-frame AUGs, suggesting that
our networks use reading frame detection. A number of con-
flicts between neural network predictions and database an-
notations are analysed in detail, leading to identification of
possible database errors.
keywords: translation initiation, start codon, kozak box, neu-
ral networks, signal peptides
Introduction
The choice of start codon in eukaryotes depends on posi-
tion as well as on context. Usually, translational initiation
takes place at the first occurrence of the triplet AUG in an
mRNA, but in some cases an AUG further downstream is
selected. This is explained by the so-called scanning hy-
pothesis, which states that the small subunit of the ribo-
some binds at the capped 5’-end of the mRNA and sub-
sequently scans the sequence until the first start codon in a
suitable context is found (Kozak 1983; 1984; Cigan & Don-
ahue 1987; Joshi 1987; Kozak 1989). It has been reported
that downstream AUGs are used as start codons in less
than 10 % of investigated eukaryotic mRNAs (Kozak 1989;
Yoon & Donahue 1992). Previous analyses of start codon
contexts found the consensus of eukaryotic translation ini-
tiation sites to be GCCACCaugG (Kozak 1984; 1987), but
¤ Phone: (+45) 45 25 24 84; Fax: (+45) 45 93 48 08,
email: gorm@cbs.dtu.dk
yPhone: (+45) 45 25 24 70, email: hnielsen@cbs.dtu.dk
further analyses has demonstrated that the pattern varies
between different groups of eukaryotes (Cavener 1987;
Lu¨tcke et al. 1987; Joshi 1987; Cigan & Donahue 1987;
Yamauchi 1991; Cavener & Ray 1991) and that these dif-
ferences are statistically significant (Pedersen & Nielsen
1997). Specifically, all vertebrates that have been investi-
gated have similar start codon contexts, as do the two mono-
cots rice and corn, while several other eukaryotic species
have significantly different signals (Pedersen & Nielsen
1997).
Since less than 10 % of all eukaryotic mRNAs reportedly
utilize downstream AUGs as start codons, it should be pos-
sible to perform prediction of translation initiation sites at
more than 90 % accuracy simply by selecting the first AUG,
given that complete and error-free mRNA sequences are
available. This, however, is very rarely the case in sequence
analysis. Thus, we find that even when great care is taken
to extract GenBank nucleotide data that is annotated as be-
ing equivalent to mature mRNA, almost 40 % of the se-
quences contain upstream AUGs. This problem is enhanced
when using unannotated genome data, and when analysing
expressed sequence tags (ESTs). ESTs are partial, single-
pass, cDNA sequences, that generally represent the comple-
ment of mRNAs in the cell, but that due to the very nature
of the technology usually contain more errors (Boguski,
Lowe, & Tolstoshev 1993; Boguski & Tolstoshev 1994;
Cooke et al. 1996; Benson et al. 1997). Thus, uncertain-
ties can exist regarding which end of an mRNA the EST
corresponds to, it is not always known whether the entire
5’ (or 3’) end is represented in the EST, the sequence can
potentially be contaminated with vector sequence, and the
automated single pass sequencing results in a higher error
rate than is found in normal genome data.
These problems make the prediction of translation ini-
tiation sites a non-trivial task. In this paper we present a
method for prediction of start codons, that is based on the
use of artificial neural networks. The results presented here
are preliminary and we are still in the process of develop-
ing the method, but we find the current performance to be
convincing. The method does not require any knowledge of
believe it can be useful in connection with analysis of EST
data and incompletely annotated genome sequences.
Methods
Data
Extraction All data were extracted from GenBank, re-
lease 95 (Benson et al. 1997). We extracted a vertebrate
group consisting of sequences from Bos taurus (cow), Gal-
lus gallus (chicken), Homo sapiens (man), Mus musculus
(mouse), Oryctolagus cuniculus (rabbit), Rattus norvegicus
(rat), Sus scrofa (pig), and Xenopus laevis (African clawed
frog). We have previously shown that these vertebrates have
similar start codon contexts (Pedersen & Nielsen 1997).
Additionally, we have chosen a data set showing large de-
viation from vertebrates, Arabidopsis thaliana (thale cress,
a dicot plant).
Nuclear genes with an annotated start codon were se-
lected. The sequences were processed in the following way:
all sequences were “spliced” by removing possible introns,
and joining the remaining exon parts. From the resulting
data set, sequences containing at least 10 nucleotides up-
stream of the initiation point and at least 150 nucleotides
downstream (relative to the A in AUG) were selected. All
sequences containing non-nucleotide symbols in the inter-
val mentioned above (typically due to incomplete sequenc-
ing) were excluded.
Redundancy All sequence databases are redundant due
to the presence of genes belonging to gene families, ho-
mologous genes from different organisms, and sequences
submitted to the database more than once. Unless this re-
dundancy is reduced before performing statistical analysis,
the result will be biased for the over-represented sequences,
and the performance of prediction methods will be overes-
timated (Sander & Schneider 1991; Hobohm et al. 1992).
We performed very thorough reduction of redundancy us-
ing algorithm 2 from (Hobohm et al. 1992) and a novel
method for finding a similarity cut-off, that we have de-
scribed elsewhere (Pedersen & Nielsen 1997). Briefly, this
method is based on performing all pairwise alignments for
a data set, fitting the resulting Smith-Waterman scores to
an extreme value distribution (Altschul et al. 1994), and
choosing a value above which there are more observations
than expected from the distribution.
The sizes of the redundancy reduced data sets were:
3312 vertebrate sequences, and 523 Arabidopsis thaliana
sequences. These data sets are available from the authors
upon request.
Neural Networks
The neural networks used in this study were of the feed-
forward type, and had three layers of neurons (Hertz,
Krogh, & Palmer 1991). They were written in the FOR-
TRAN programming language by Søren Brunak, and has
previously been used for several other prediction pur-
poses [e.g., (Brunak, Engelbrecht, & Knudsen 1990; 1991;
Hansen et al. 1995)]. Inputs were presented to the networks
by encoding the DNA sequence into a binary string, using
a coding scheme where each nucleotide is represented by
4 binary digits: A=0001, C=0010, G=0100, and T=1000
(sparse encoding). The output layer consisted of two neu-
rons — one predicting whether the central position in the
window was the A in a start codon AUG, the other predict-
ing whether it was the A in a non-start codon AUG. The
output of the network was interpreted by believing the out-
put neuron with the highest score (the “winner takes all”
approach). Neural network performance was estimated us-
ing the Mathews correlation coefficient (Mathews 1975).
Prediction of Signal Peptides
In order to test our method for prediction of start codons,
we have combined it with a method for prediction of sig-
nal peptides in amino acid sequences: The SignalP server
(Nielsen et al. 1997). This method uses a combination of
neural networks to predict the presence of signal peptides
and the location of their cleavage sites.
SignalP returns three scores from every position in the
sequence: a cleavage site score (C-score) from networks
trained to recognise cleavage sites, a signal peptide score
(S-score) from networks trained to distinguish between sig-
nal peptide and non-signal peptide positions, and a com-
bined cleavage site score (Y-score), which optimises the
prediction of cleavage site location by combining the C-
score with the derivative of the S-score. Discrimination
between signal peptides and N-terminals of non-secretory
proteins is performed using the maximal value of one of
the three scores or the mean value of the S-score (from the
N-terminus to the position with maximal Y-score). Each
network ensemble has a specific threshold value for each of
these measures.
Results and Discussion
As mentioned, it should be possible to predict translation
initiation start sites at better than 90 % accuracy, if one
has access to entire error-free mRNA sequences. However,
when we analysed our data sets with the purpose of extract-
ing sequences corresponding to mature mRNAs, we found
that only about 10 % (387 out of 3312) of the sequences
in the vertebrate set had sufficient annotation for this pur-
pose. (In the remaining cases the exact in vivo transcrip-
tional startpoints and upstream splice sites have not been
determined). Further analysis of the resulting vertebrate
mRNAs demonstrated that almost 40 % (150 out of 387)
contained one or more upstream AUGs. Thus, it was only
possible to use the simple “first-AUG” rule in the remain-
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


