Statistical analysis of over-represented words in human promoter sequences

96Citations
Citations of this article
50Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from -2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.

Cite

CITATION STYLE

APA

Mariño-Ramírez, L., Spouge, J. L., Kanga, G. C., & Landsman, D. (2004). Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Research, 32(3), 949–958. https://doi.org/10.1093/nar/gkh246

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free