Sign up & Download
Sign in

A census of human transcription factors: function, expression and evolution.

by Juan M Vaquerizas, Sarah K Kummerfeld, Sarah A Teichmann, Nicholas M Luscombe
Nature Reviews Genetics (2009)

Abstract

Transcription factors are key cellular components that control gene expression: their activities determine how cells function and respond to the environment. Currently, there is great interest in research into human transcriptional regulation. However, surprisingly little is known about these regulators themselves. For example, how many transcription factors does the human genome contain? How are they expressed in different tissues? Are they evolutionarily conserved? Here, we present an analysis of 1,391 manually curated sequence-specific DNA-binding transcription factors, their functions, genomic organization and evolutionary conservation. Much remains to be explored, but this study provides a solid foundation for future investigations to elucidate regulatory mechanisms underlying diverse mammalian biological processes.

Cite this document (BETA)

Available from discovery.ucl.ac.uk
Page 1
hidden

A census of human transcription factors: function, expression and evolution.

Cellular life must recognize and respond appropriately
to diverse internal and external stimuli. By ensuring the
correct expression of specific genes, the transcriptional
regulatory system plays a central part in controlling many
biological processes, ranging from cell cycle progression1
and maintenance of intracellular metabolic and physio­
logical balance, to cellular differentiation and develop­
mental time courses2–4. Numerous diseases arise from a
breakdown in the regulatory system: transcription fac­
tors (TFs) are overrepresented among oncogenes5, and
a third of human developmental disorders have been
attributed to dysfunctional TFs6. Furthermore, altera­
tions in the activity and regulatory specificity of TFs are
likely to be a major source for phenotypic diversity and
evolutionary adaptation7–9. Indeed, increased sophistica­
tion of the transcriptional regulatory system seems to
have been a principal requirement for the emergence of
metazoan life10–13.
Much of our basic knowledge of transcriptional reg­
ulation derives from molecular biological and genetic
investigations. Diverse arrays of proteins are crucial for
successful transcription by RNA polymerase in eukary­
otic cells. These proteins include general transcription
factors, co-factors, histones and chromatin remodelling
proteins. In addition, a host of sequence­specific DNA­
binding TFs direct transcription initiation to specific
promoters14.
The availability of complete genome sequences and
the development of high­throughput experimental tech­
niques in the past decade have and continue to provide
complementary information describing the function and
organization of these regulatory systems on an unprec­
edented scale. Computational studies have reported TF
repertoires by searching for genes containing DNA­
binding domains either across all completely sequenced
genomes15, or for individual organisms and phylogenetic
groups, including bacteria (such as Escherichia coli16 and
Bacillus subtilis17), fungi18 (including Saccharomyces
cerevisiae19), animals (including Caenorhabditis elegans20,
Drosophila melanogaster 21 and Mus musculus22) and
plants23 (such as Arabidopsis thaliana24).
For humans, the initial analyses of the complete
genome sequence estimated the presence of 200 to 300
component genes for the basic transcriptional machinery,
and 2,000 to 3,000 sequence­specific DNA­binding
TFs25,26. The automated annotation in the Gene Ontology
(GO) database27 (available at Gene Ontology Home),
which is based on mapping InterPro28 DNA­binding
domains, currently predicts 1,052 TF genes; of these,
only 62 have been experimentally verified for both
DNA­binding and regulatory functions (Supplementary
information S1 (PDF)). The DBD database predicts 1,508
human loci as TFs15. It automatically annotates sequence­
specific DNA­binding TFs for all publicly available
*EMBL-European
Bioinformatics Institute,
Wellcome Trust Genome
Campus, Cambridge CB10
1SD, UK.
‡MRC Laboratory of
Molecular Biology, Hills Road,
Cambridge CB2 0QH, UK.
§Present address: Department
of Bioinformatics, Genentech
Inc., South San Francisco,
California 94080, USA.
||EMBL-Heidelberg Gene
Expression Unit,
Meyerhofstrasse 1, Heidelberg
D-69117, Germany.
Correspondence to J.M.V or
N.M.L.
e-mails: jvaquerizas@ebi.
ac.uk; luscombe@ebi.ac.uk
doi:10.1038/nrg2538
Published online
10 March 2009
General transcription factor
One of a group of proteins that
are essential for transcription
from a eukaryotic promoter.
They are involved in the
formation of the pre-initiation
complex and the recruitment
of RNA polymerase.
A census of human transcription
factors: function, expression and
evolution
Juan M. Vaquerizas*, Sarah K. Kummerfeld‡§, Sarah A. Teichmann‡ and
Nicholas M. Luscombe*||
Abstract | Transcription factors are key cellular components that control gene expression:
their activities determine how cells function and respond to the environment.
Currently, there is great interest in research into human transcriptional regulation.
However, surprisingly little is known about these regulators themselves. For example,
how many transcription factors does the human genome contain? How are they
expressed in different tissues? Are they evolutionarily conserved? Here, we present an
analysis of 1,391 manually curated sequence-specific DNA-binding transcription factors,
their functions, genomic organization and evolutionary conservation. Much remains
to be explored, but this study provides a solid foundation for future investigations to
elucidate regulatory mechanisms underlying diverse mammalian biological processes.
ANAlysis
252 | APRIl 2009 | VOluMe 10 www.nature.com/reviews/genetics
© 2009 Macmillan Publishers Limited. All rights reserved
Page 2
hidden
Co-factor
A protein or small molecule
that modulates the activity of
an enzyme or of another
protein complex.
Histone
A small highly conserved basic
protein, found in the chromatin
of all eukaryotic cells. Histones
associate with DNA to form
nucleosomes.
Chromatin remodelling
protein
A protein that mediates
transient changes in chromatin
accessibility by modifying the
methylation or acetylation
status of histones or the
methylation status of cytosine
residues in DNA.
Gene Ontology
(GO). A widely used
classification system
of gene functions and other
gene attributes that uses a
controlled vocabulary.
InterPro
A database of conserved
protein families, domains and
motifs that can be used to
annotate amino acid
sequences. The presence of a
protein domain is often
indicative of a particular
molecular function.
SELEX
A procedure to identify protein
ligands. For DNA-binding
proteins, the protein is mixed
with a pool of double-stranded
oligonucleotides that contain a
random core of nucleotides
flanked by specific sequences.
The protein–DNA complex is
recovered, the oligonucleotides
amplified by PCR and
sequenced to reveal
the binding specificity of the
protein.
Orthologues
Loci in two species that are
derived from a common
ancestral locus by a speciation
event. This is different from
paralogous members of a gene
family that are derived from
duplication events.
completely sequenced genomes based on a set of hidden
Markov models of DNA­binding domain families from
the Pfam and SuPeRFAMIlY databases. Several com­
putational studies have examined individual mamma­
lian TF families in detail, but only a few have attempted
to identify the full complement of human TFs29,30.
Some previous studies of TF repertoires — particu­
larly those in large genomes — may contain misleading
predictions for several reasons. Most of these studies
depended on identifying genes that are homologous
to previously characterized regulators; however, there
are technical limitations to sequence search methods,
and algorithms can sometimes output false positive
hits. Moreover, even among true positives, some DNA­
binding domains also exist in non­TF proteins, making
these domains unreliable markers of sequence­specific
DNA­binding functionality. As a result of these difficul­
ties, we still lack a comprehensive characterization of the
human TF repertoire.
Here, we overcome some of these difficulties by
focusing on a precise definition of sequence­specific
DNA­binding regulators, which are among the best­
defined protein domains. We also minimize prediction
errors by manually examining each locus that encodes
a potential DNA­binding function. In doing so, we
present a comprehensive and high­quality census of
TFs in the human genome. As most of these TFs have
not been experimentally characterized for regulatory
function, we evaluate their tissue­specific expression,
genomic distribution and evolutionary conservation.
Together, these results provide a solid foundation for
further systematic characterization of human TFs in
their bio logical context, through traditional molecular
approaches and also using genomics techniques, such
as chromatin immunoprecipitation, protein­binding
microarrays and high­throughput SELEX.
Identifying the TF repertoire
To identify the repertoire of TFs in the human genome
we define a class of proteins that binds DNA in a
sequence­specific manner, but are not enzymatic or do
not form part of the core initiation complex. First we
assembled a list of DNA­binding domains and families
from the InterPro database (release 17). For each entry
we examined the description and associated literature
to assess their sequence­specific DNA­binding capabili­
ties, which resulted in an accurate list of 347 domains
and families (Supplementary information S1 (PDF),
S2 (.txt file)). We then extracted 4,610 proteins from the
International Protein Index (IPI) database31 that show
a significant match with these selected DNA­binding
domains. This group of proteins mapped to 1,960
human genomic loci in the ensembl Genome Browser
database (release 51)32.
Next, we manually inspected each locus and grouped
them according to our confidence in their TF func­
tionality (Supplementary information S1 (PDF); the
full data set is found in Supplementary information S3
(.txt file)): at the highest level, probable TFs have experi­
mental evidence for regulatory function in any mam­
malian organism or have an equivalent protein domain
arrangement; possible TFs contain non­promiscuous
InterPro DNA­binding domains that are never found
in non­TFs, but for which we do not have further func­
tional evidence; and unlikely TFs comprise predicted
genes, genes containing promiscuous InterPro DNA­
binding domains or genes with an established molecular
function other than transcription (such as nucleoporins,
threonine phosphatases or splicing factors). Finally,
we also included 27 curated probable TFs from other
sources, such as GO or TRANSFAC33; these TFs contain
undefined DNA­binding domains, and were therefore
missed using the above procedure.
This resulted in a high­confidence data set of 1,391
genomic loci (~6% of the total number of protein­
coding genes) that encode TFs, which we will focus on
for the remainder of this Analysis article, and a further
216 loci representing possible TFs (see Supplementary
information S3 (.txt file) for the data set). estimates of
the coverage of our approach range from 85% to 94%
(Supplementary information S1 (PDF)) suggesting an
upper bound of ~1,700–1,900 TF­coding genes in the
human genome.
Despite the care that we have taken in compiling this
data set, there are a few possible sources of inaccura­
cies. Our method depends heavily on the content of the
InterPro database, and the ability of the search algo­
rithms to detect these domains in protein sequences.
The repertoire should be updated when new InterPro
entries for newly discovered DNA­binding domains, or
refinements of existing ones, and more sensitive search
methods become available. In addition, the annotation
of the human genome is still in a state of flux — espe­
cially in the annotation of genes — so part of the rep­
ertoire will be affected by new releases of the genome.
Finally, our manual curation depends on the existing
literature about each gene, and our own annotations will
need to be updated as new findings are reported. The
repertoire will be improved as these underlying data
sources are updated. Overall, we expect these limit­
ations to be small compared with the improvements
that our data provide over previous resources.
Limited knowledge of TF functions
We gauged the extent of our current knowledge about
the regulatory function of these TFs by assessing
PubMed abstracts and annotations in the GO database.
The literature analysis (FIG. 1a), based on the number of
times a TF is cited in an abstract, shows an uneven dis­
tribution of information that is biased towards those TFs
involved in diseases. Three TFs, including the tumour
suppressor p53, have accumulated more citations than
all other TFs.
Further analysis using the GO database (FIG. 1b)
showed that most human TFs are unannotated, indi­
cating that they remain uncharacterized. In fact, when
we inspect the source of these annotations, it is evident
that most observations are inferred from studies in
other organisms and may not apply directly to human
orthologues. Of the assigned regulatory functions, con­
trol of developmental processes (such as tissue and
organ development), cellular processes (for example,
A n A ly s i s
NATuRe ReVIeWS | Genetics VOluMe 10 | APRIl 2009 | 253
© 2009 Macmillan Publishers Limited. All rights reserved

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

201 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
34% Ph.D. Student
 
15% Post Doc
 
12% Researcher (at an Academic Institution)
by Country
 
29% United States
 
14% United Kingdom
 
8% Germany

Tags