Statistical analysis of over-represented words in human promoter sequences

Leonardo Mariño-Ramírez; John L. Spouge; Gavin C. Kanga; David Landsman

Journal ArticleOPEN ACCESS

Statistical analysis of over-represented words in human promoter sequences

Nucleic Acids Research (2004) 32(3) 949-958

DOI: 10.1093/nar/gkh246

96Citations

50Readers

Abstract

The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from -2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.

Cite

CITATION STYLE

APA

Mariño-Ramírez, L., Spouge, J. L., Kanga, G. C., & Landsman, D. (2004). Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Research, 32(3), 949–958. https://doi.org/10.1093/nar/gkh246

Statistical analysis of over-represented words in human promoter sequences

Abstract

Cite

Register to see more suggestions