An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing.
- DOI: 10.1073/pnas.0409882102
- PubMed: 15778292
Abstract
With the recent completion of a high-quality sequence of the human genome, the challenge is now to understand the functional elements that it encodes. Comparative genomic analysis offers a powerful approach for finding such elements by identifying sequences that have been highly conserved during evolution. Here, we propose an initial strategy for detecting such regions by generating low-redundancy sequence from a collection of 16 eutherian mammals, beyond the 7 for which genome sequence data are already available. We show that such sequence can be accurately aligned to the human genome and used to identify most of the highly conserved regions. Although not a long-term substitute for generating high-quality genomic sequences from many mammalian species, this strategy represents a practical initial approach for rapidly annotating the most evolutionarily conserved sequences in the human genome, providing a key resource for the systematic study of human genome function.
An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing.
of functional elements in the human genome by
low-redundancy comparative sequencing
Elliott H. Margulies*
†
, Jade P. Vinson
†‡
, NISC Comparative Sequencing Program*
§¶
, Webb Miller
, David B. Jaffe
‡
,
Kerstin Lindblad-Toh
‡
, Jean L. Chang
‡
, Eric D. Green*
§
, Eric S. Lander
‡
, James C. Mullikin*
§
**, and Michele Clamp
‡
**
*Genome Technology Branch and
§
NISC, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892;
‡
Broad Institute
of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02141; and
Department of Computer Science and Engineering,
Pennsylvania State University, University Park, PA 16802
Contributed by Eric S. Lander, December 30, 2004
With the recent completion of a high-quality sequence of the
human genome, the challenge is now to understand the functional
elements that it encodes. Comparative genomic analysis offers a
powerful approach for finding such elements by identifying se-
quences that have been highly conserved during evolution. Here,
we propose an initial strategy for detecting such regions by
generating low-redundancy sequence from a collection of 16
eutherian mammals, beyond the 7 for which genome sequence
data are already available. We show that such sequence can be
accurately aligned to the human genome and used to identifymost
of the highly conserved regions. Although not a long-term substi-
tute for generating high-quality genomic sequences from many
mammalian species, this strategy represents a practical initial
approach for rapidly annotating themost evolutionarily conserved
sequences in the human genome, providing a key resource for the
systematic study of human genome function.
comparative genomics genome sequencing genome analysis
phylogenetics mammalian evolution
C
omprehensive identification of functional elements in the
human genome represents a central and ambitious goal in
genomics (1). We currently have only rudimentary knowledge
about such elements (apart from protein-coding sequences), and
it is thus impossible to identify them directly from the human
genome sequence. A powerful and unbiased approach for de-
tecting candidates for such functionally important sequences is
to compare orthologous regions from multiple related species to
identify those regions that are evolving slowly and are thus likely
to be under purifying selection. The crucible of evolution is a
very sensitive assay for function: Selection will robustly reject
mutations that decrease the fitness of a mammal to 99.9% of
normal (2), whereas such a decrease is undetectable in typical
laboratory tests.
The first opportunity to compare entire mammalian genomes
came with the sequencing of the mouse (3) and subsequently the
rat (4) genomes. Strikingly, sequence comparisons between the
human genome and either rodent genome revealed that5% of
each of these genomes appears to be under purifying selection.
Specifically, this analysis involved comparing (i) the distribution
of calculated conservation scores for bases (assessed in small
windows) across the entire genome with (ii) the distribution of
the conservation scores for bases within transposable element
fossils predating the divergence of humans and rodents (called
ancestral repeats, which are thought to be nonfunctional and
thus evolving at the background rate of neutral evolution). The
former distribution showed a clear excess of bases with higher-
than-average conservation scores, corresponding to about 5% of
the genome. These results were surprising because it had been
tacitly assumed that the predominant functional sequences in the
mammalian genome were those directly encoding proteins, but
these account for 2% of the genome. The full nature of the
remaining 3% of the genome remains a mystery; presumably,
they include gene-regulatory elements, RNA genes, chromo-
somal structural elements, and other as-yet-unknown functional
elements.
Although the human–rodent sequence comparisons allow an
overall estimate of the amount of the human genome that is
functionally important (more precisely, under purifying selec-
tion), such analyses are inadequate for accurately identifying
most of the functional elements. Some regions can be clearly
identified as under strong constraint, such as ‘‘ultra-conserved
sequences’’ with 100% identity over hundreds of bases across
several species (5). However, most regions have intermediate
conservation scores; some of these are functional elements and
some represent the right tail of the distribution of neutrally
evolving sequences. These alternatives can be distinguished by
analyzing sequences from additional species (6), so that func-
tional sequences stand out against the background of neutral
evolution. In particular, the signal-to-noise ratio increases as one
expands the comparison to an evolutionary tree with more
species and longer total branch length.
How many mammals must be sampled for identifying func-
tional elements in the mammalian genome? The answer depends
on the precise goal(s) being pursued. Several studies have
investigated this issue (7–10).
For example, Kellis et al. (10) described the ability to perform
systematic identification of gene-regulatory elements in yeast,
consisting of weakly conserved six-base sequences that occur
multiple times in the genome. They extrapolated that similar
results could be obtained for the human genome with sequence
data from species constituting an evolutionary tree that provides
a total branch length D 4.
More generally, Eddy (11) considered the identification of
individual sequence elements. He reported formulas for calcu-
lating the number N of mammalian species related by an
evolutionary tree with equal branches of length d that would be
needed to detect a given type of element, as a function of the
element’s length L, the conservation rate among the species,
and the desired false-positive and false-negative rates. Consid-
ering highly conserved sequences (with each base evolving at a
rate 20% of the neutral rate), his results show that elements
with L 50 bases can be detected with only human and mouse
sequences (total branch length D 0.45). Detection of elements
Abbreviations: NISC, National Institutes of Health Intramural Sequencing Center; BAC,
bacterial artificial chromosome; MCS, multispecies conserved sequence.
†
E.H.M. and J.P.V. contributed equally to this work.
¶
National Institutes of Health Intramural Sequencing Center (NISC) Comparative Sequenc-
ing Program: Leadership provided by Robert W. Blakesley, Gerard G. Bouffard, Nancy F.
Hansen, Baishali Maskeri, Pamela J. Thomas, and Jennifer C. McDowell.
**To whom correspondence may be addressed. E-mail: mullikin@mail.nih.gov or
mclamp@broad.mit.edu.
? 2005 by The National Academy of Sciences of the USA
www.pnas.orgcgidoi10.1073pnas.0409882102 PNAS March 29, 2005 vol. 102 no. 13 4795–4800
G
E
N
E
T
I
C
S
a total branch length D4 (for example, 40 species each with
d 0.1 from a common root). Detection of single bases under
purifying selection (L 1) could be achieved with a total branch
length D 32 (for example, 320 species at d 0.1 from a
common root).
Based on such analyses, a reasonable starting point would be
to obtain sequence from a set of mammals that provides a total
branch length D 4, with the aim of identifying functional
elements eight bases or more in length. Fig. 1 shows one possible
choice of species, displayed in an evolutionary tree that indicates
the phylogenetic relationships and branch lengths. The species
are divided into three sets: seven mammals for which high-
redundancy genomic sequence has already been generated (total
branch length D0.95); eight additional mammals (set 1) that
increase the branch length to D 2.4; and eight further
mammals (set 2) that increase the total to D3.8. Ultimately,
it would be desirable to have high-quality near-complete
genomic sequence for all 16 of these additional mammals.
However, this would require at least 8-fold sequence redundancy
of each genome, or nearly 400 gigabases (Gb) of raw sequence.
Given current capacities and costs, such an effort would require
a large investment of resources and a considerable period of
time.
We thus sought to explore an initial approach to obtain a
substantial portion of the information at lower cost and in less
time. Specifically, we investigated the utility of generating lower-
redundancy sequence of each genome. Simple mathematical
modeling (12) predicts that roughly 2-fold average redundancy
should cover 86% (1 e
2
) of bases in each mammalian
genome, thereby providing considerable (albeit incomplete)
data. Increasing the amount of sequencing to about 8-fold
average redundancy would increase the proportion of each
genome covered to 99% (1 e
8
) as well as enhance the
continuity and accuracy of the resulting assembled sequences,
but the associated costs would be roughly 4-fold greater for a
modest gain in coverage.
Before embarking on a low-redundancy sequencing strategy,
it is essential to demonstrate the practical utility of such data for
comparing mammalian genomes. We thus sought to investigate
two key questions. First, can low-redundancy sequence be ac-
curately and completely positioned relative to the orthologous
sequence in the human genome, thereby allowing fine-scale
alignment and meaningful comparative analyses? Second, are
the resulting alignments sufficient for identifying the most highly
conserved sequences in the human genome? To investigate such
issues, we directly compare the performance of low-redundancy
sequence versus high-quality finished sequence.
Fig. 1. Phylogenetic tree of eutherian mammals proposed for genome sequencing. Various sets of eutherian mammals are shown: set 0 consisting of seven
species for which high-redundancy genomic sequence is already available (black), set 1 consisting of an additional eight species proposed for sequencing (red),
and set 2 consisting of a further eight species proposed for sequencing (blue). The tree also shows amarsupial (purple), for which genomic sequence is available,
and amonotreme (gray). The Inset table lists each species, the branch length (divergence) relative to human (in substitutions per site), the increase in total branch
length (D) provided by adding each species to those above, and the total branch length provided by that species combined with those above. Details about the
phylogenetic tree and the associated branch lengths are provided in the supporting information, which is published on the PNAS web site and at www.nisc.
nih.govdata.
4796 www.pnas.orgcgidoi10.1073pnas.0409882102 Margulies et al.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


