The mechanism by which protein-coding portions of eukaryotic genes came to be separated by long non-coding stretches of DNA, and the purpose for this perplexing arrangement, have remained unresolved fundamental biological problems for three decades. We report here a plausible solution to this problem based on analysis of open reading frame (ORF) length constraints in the genomes of nine diverse species. If primordial nucleic acid sequences were random in sequence, functional proteins that are innately long would not be encoded due to the frequent occurrence of stop codons. The best possible way that a long protein-coding sequence could have been derived was by evolving a split-structure from the random DNA (or RNA) sequence. Results of the systematic analyses of nine complete genome sequences presented here suggests that perhaps the major underlying structural features of split-genes have evolved due to the indigenous occurrence of split protein-coding genes in primordial random nucleotide sequence. The results also suggest that intron-rich genes containing short exons may have been the original form of genes intrinsically occurring in random DNA, and that intron-poor genes containing long exons were perhaps derived from the original intron-rich genes.
Regulapati, R., Bhasi, A., Singh, C. K., & Senapathy, P. (2008). Origination of the split structure of spliceosomal genes from random genetic sequences. PLoS ONE, 3(10). https://doi.org/10.1371/journal.pone.0003456