Initial sequencing and analysis of the human genome.
Nature (2001)
- PubMed: 11237011
Available from
Deanna Church and Tarjei Mikkelsen's profiles on Mendeley.
or
Abstract
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
Author-supplied keywords
Available from
Deanna Church and Tarjei Mikkelsen's profiles on Mendeley.
Page 1
Initial sequencing and analysis of the human genome.
Initial sequencing and analysis of the
human genome
International Human Genome Sequencing Consortium*
* A partial list of authors appears on the opposite page. Af®liations are listed at the end of the paper.
............................................................................................................................................................................................................................................................................
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution.
Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human
genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
The rediscovery of Mendel's laws of heredity in the opening weeks of
the 20th century
1±3
sparked a scienti®c quest to understand the
nature and content of genetic information that has propelled
biology for the last hundred years. The scienti®c progress made
falls naturally into four main phases, corresponding roughly to the
four quarters of the century. The ®rst established the cellular basis of
heredity: the chromosomes. The second de®ned the molecular basis
of heredity: the DNA double helix. The third unlocked the informa-
tional basis of heredity, with the discovery of the biological mechan-
ism by which cells read the information contained in genes and with
the invention of the recombinant DNA technologies of cloning and
sequencing by which scientists can do the same.
The last quarter of a century has been marked by a relentless drive
to decipher ®rst genes and then entire genomes, spawning the ®eld
of genomics. The fruits of this work already include the genome
sequences of 599 viruses and viroids, 205 naturally occurring
plasmids, 185 organelles, 31 eubacteria, seven archaea, one
fungus, two animals and one plant.
Here we report the results of a collaboration involving 20 groups
from the United States, the United Kingdom, Japan, France,
Germany and China to produce a draft sequence of the human
genome. The draft genome sequence was generated from a physical
map covering more than 96% of the euchromatic part of the human
genome and, together with additional sequence in public databases,
it covers about 94% of the human genome. The sequence was
produced over a relatively short period, with coverage rising from
about 10% to more than 90% over roughly ®fteen months. The
sequence data have been made available without restriction and
updated daily throughout the project. The task ahead is to produce a
®nished sequence, by closing all gaps and resolving all ambiguities.
Already about one billion bases are in ®nal form and the task of
bringing the vast majority of the sequence to this standard is now
straightforward and should proceed rapidly.
The sequence of the human genome is of interest in several
respects. It is the largest genome to be extensively sequenced so far,
being 25 times as large as any previously sequenced genome and
eight times as large as the sum of all such genomes. It is the ®rst
vertebrate genome to be extensively sequenced. And, uniquely, it is
the genome of our own species.
Much work remains to be done to produce a complete ®nished
sequence, but the vast trove of information that has become
available through this collaborative effort allows a global perspective
on the human genome. Although the details will change as the
sequence is ®nished, many points are already clear.
X The genomic landscape shows marked variation in the distribu-
tion of a number of features, including genes, transposable
elements, GC content, CpG islands and recombination rate. This
gives us important clues about function. For example, the devel-
opmentally important HOX gene clusters are the most repeat-poor
regions of the human genome, probably re¯ecting the very complex
coordinate regulation of the genes in the clusters.
XThere appear to be about 30,000±40,000 protein-coding genes in
the human genomeÐonly about twice as many as in worm or ¯y.
However, the genes are more complex, with more alternative
splicing generating a larger number of protein products.
XThe full set of proteins (the `proteome') encoded by the human
genome is more complex than those of invertebrates. This is due in
part to the presence of vertebrate-speci®c protein domains and
motifs (an estimated 7% of the total), but more to the fact that
vertebrates appear to have arranged pre-existing components into a
richer collection of domain architectures.
XHundreds of human genes appear likely to have resulted from
horizontal transfer from bacteria at some point in the vertebrate
lineage. Dozens of genes appear to have been derived from trans-
posable elements.
XAlthough about half of the human genome derives from trans-
posable elements, there has been a marked decline in the overall
activity of such elements in the hominid lineage. DNA transposons
appear to have become completely inactive and long-terminal
repeat (LTR) retroposons may also have done so.
XThe pericentromeric and subtelomeric regions of chromosomes
are ®lled with large recent segmental duplications of sequence from
elsewhere in the genome. Segmental duplication is much more
frequent in humans than in yeast, ¯y or worm.
XAnalysis of the organization of Alu elements explains the long-
standing mystery of their surprising genomic distribution, and
suggests that there may be strong selection in favour of preferential
retention of Alu elements in GC-rich regions and that these `sel®sh'
elements may bene®t their human hosts.
XThe mutation rate is about twice as high in male as in female
meiosis, showing that most mutation occurs in males.
XCytogenetic analysis of the sequenced clones con®rms sugges-
tions that large GC-poor regions are strongly correlated with `dark
G-bands' in karyotypes.
XRecombination rates tend to be much higher in distal regions
(around 20 megabases (Mb)) of chromosomes and on shorter
chromosome arms in general, in a pattern that promotes the
occurrence of at least one crossover per chromosome arm in each
meiosis.
XMore than 1.4 million single nucleotide polymorphisms (SNPs)
in the human genome have been identi®ed. This collection should
allow the initiation of genome-wide linkage disequilibrium
mapping of the genes in the human population.
In this paper, we start by presenting background information on
the project and describing the generation, assembly and evaluation
of the draft genome sequence. We then focus on an initial analysis of
the sequence itself: the broad chromosomal landscape; the repeat
elements and the rich palaeontological record of evolutionary and
biological processes that they provide; the human genes and
proteins and their differences and similarities with those of other
articles
860 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
human genome
International Human Genome Sequencing Consortium*
* A partial list of authors appears on the opposite page. Af®liations are listed at the end of the paper.
............................................................................................................................................................................................................................................................................
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution.
Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human
genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
The rediscovery of Mendel's laws of heredity in the opening weeks of
the 20th century
1±3
sparked a scienti®c quest to understand the
nature and content of genetic information that has propelled
biology for the last hundred years. The scienti®c progress made
falls naturally into four main phases, corresponding roughly to the
four quarters of the century. The ®rst established the cellular basis of
heredity: the chromosomes. The second de®ned the molecular basis
of heredity: the DNA double helix. The third unlocked the informa-
tional basis of heredity, with the discovery of the biological mechan-
ism by which cells read the information contained in genes and with
the invention of the recombinant DNA technologies of cloning and
sequencing by which scientists can do the same.
The last quarter of a century has been marked by a relentless drive
to decipher ®rst genes and then entire genomes, spawning the ®eld
of genomics. The fruits of this work already include the genome
sequences of 599 viruses and viroids, 205 naturally occurring
plasmids, 185 organelles, 31 eubacteria, seven archaea, one
fungus, two animals and one plant.
Here we report the results of a collaboration involving 20 groups
from the United States, the United Kingdom, Japan, France,
Germany and China to produce a draft sequence of the human
genome. The draft genome sequence was generated from a physical
map covering more than 96% of the euchromatic part of the human
genome and, together with additional sequence in public databases,
it covers about 94% of the human genome. The sequence was
produced over a relatively short period, with coverage rising from
about 10% to more than 90% over roughly ®fteen months. The
sequence data have been made available without restriction and
updated daily throughout the project. The task ahead is to produce a
®nished sequence, by closing all gaps and resolving all ambiguities.
Already about one billion bases are in ®nal form and the task of
bringing the vast majority of the sequence to this standard is now
straightforward and should proceed rapidly.
The sequence of the human genome is of interest in several
respects. It is the largest genome to be extensively sequenced so far,
being 25 times as large as any previously sequenced genome and
eight times as large as the sum of all such genomes. It is the ®rst
vertebrate genome to be extensively sequenced. And, uniquely, it is
the genome of our own species.
Much work remains to be done to produce a complete ®nished
sequence, but the vast trove of information that has become
available through this collaborative effort allows a global perspective
on the human genome. Although the details will change as the
sequence is ®nished, many points are already clear.
X The genomic landscape shows marked variation in the distribu-
tion of a number of features, including genes, transposable
elements, GC content, CpG islands and recombination rate. This
gives us important clues about function. For example, the devel-
opmentally important HOX gene clusters are the most repeat-poor
regions of the human genome, probably re¯ecting the very complex
coordinate regulation of the genes in the clusters.
XThere appear to be about 30,000±40,000 protein-coding genes in
the human genomeÐonly about twice as many as in worm or ¯y.
However, the genes are more complex, with more alternative
splicing generating a larger number of protein products.
XThe full set of proteins (the `proteome') encoded by the human
genome is more complex than those of invertebrates. This is due in
part to the presence of vertebrate-speci®c protein domains and
motifs (an estimated 7% of the total), but more to the fact that
vertebrates appear to have arranged pre-existing components into a
richer collection of domain architectures.
XHundreds of human genes appear likely to have resulted from
horizontal transfer from bacteria at some point in the vertebrate
lineage. Dozens of genes appear to have been derived from trans-
posable elements.
XAlthough about half of the human genome derives from trans-
posable elements, there has been a marked decline in the overall
activity of such elements in the hominid lineage. DNA transposons
appear to have become completely inactive and long-terminal
repeat (LTR) retroposons may also have done so.
XThe pericentromeric and subtelomeric regions of chromosomes
are ®lled with large recent segmental duplications of sequence from
elsewhere in the genome. Segmental duplication is much more
frequent in humans than in yeast, ¯y or worm.
XAnalysis of the organization of Alu elements explains the long-
standing mystery of their surprising genomic distribution, and
suggests that there may be strong selection in favour of preferential
retention of Alu elements in GC-rich regions and that these `sel®sh'
elements may bene®t their human hosts.
XThe mutation rate is about twice as high in male as in female
meiosis, showing that most mutation occurs in males.
XCytogenetic analysis of the sequenced clones con®rms sugges-
tions that large GC-poor regions are strongly correlated with `dark
G-bands' in karyotypes.
XRecombination rates tend to be much higher in distal regions
(around 20 megabases (Mb)) of chromosomes and on shorter
chromosome arms in general, in a pattern that promotes the
occurrence of at least one crossover per chromosome arm in each
meiosis.
XMore than 1.4 million single nucleotide polymorphisms (SNPs)
in the human genome have been identi®ed. This collection should
allow the initiation of genome-wide linkage disequilibrium
mapping of the genes in the human population.
In this paper, we start by presenting background information on
the project and describing the generation, assembly and evaluation
of the draft genome sequence. We then focus on an initial analysis of
the sequence itself: the broad chromosomal landscape; the repeat
elements and the rich palaeontological record of evolutionary and
biological processes that they provide; the human genes and
proteins and their differences and similarities with those of other
articles
860 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
Page 2
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 861
Genome Sequencing Centres (Listed in order of total genomic
sequence contributed, with a partial list of personnel. A full list of
contributors at each centre is available as Supplementary
Information.)
Whitehead Institute for Biomedical Research, Center for Genome
Research: Eric S. Lander
1
*, Lauren M. Linton
1
, Bruce Birren
1
*,
Chad Nusbaum
1
*, Michael C. Zody
1
*, Jennifer Baldwin
1
,
Keri Devon
1
, Ken Dewar
1
, Michael Doyle
1
, William FitzHugh
1
*,
Roel Funke
1
, Diane Gage
1
, Katrina Harris
1
, Andrew Heaford
1
,
John Howland
1
, Lisa Kann
1
, Jessica Lehoczky
1
, Rosie LeVine
1
,
Paul McEwan
1
, Kevin McKernan
1
, James Meldrim
1
, Jill P. Mesirov
1
*,
Cher Miranda
1
, William Morris
1
, Jerome Naylor
1
,
Christina Raymond
1
, Mark Rosetti
1
, Ralph Santos
1
,
Andrew Sheridan
1
, Carrie Sougnez
1
, Nicole Stange-Thomann
1
,
Nikola Stojanovic
1
, Aravind Subramanian
1
& Dudley Wyman
1
The Sanger Centre: Jane Rogers
2
, John Sulston
2
*,
Rachael Ainscough
2
, Stephan Beck
2
, David Bentley
2
, John Burton
2
,
Christopher Clee
2
, Nigel Carter
2
, Alan Coulson
2
,
Rebecca Deadman
2
, Panos Deloukas
2
, Andrew Dunham
2
,
Ian Dunham
2
, Richard Durbin
2
*, Lisa French
2
, Darren Grafham
2
,
Simon Gregory
2
, Tim Hubbard
2
*, Sean Humphray
2
, Adrienne Hunt
2
,
Matthew Jones
2
, Christine Lloyd
2
, Amanda McMurray
2
,
Lucy Matthews
2
, Simon Mercer
2
, Sarah Milne
2
, James C. Mullikin
2
*,
Andrew Mungall
2
, Robert Plumb
2
, Mark Ross
2
, Ratna Shownkeen
2
& Sarah Sims
2
Washington University Genome Sequencing Center:
Robert H. Waterston
3
*, Richard K. Wilson
3
, LaDeana W. Hillier
3
*,
John D. McPherson
3
, Marco A. Marra
3
, Elaine R. Mardis
3
,
Lucinda A. Fulton
3
, Asif T. Chinwalla
3
*, Kymberlie H. Pepin
3
,
Warren R. Gish
3
, Stephanie L. Chissoe
3
, Michael C. Wendl
3
,
Kim D. Delehaunty
3
, Tracie L. Miner
3
, Andrew Delehaunty
3
,
Jason B. Kramer
3
, Lisa L. Cook
3
, Robert S. Fulton
3
,
Douglas L. Johnson
3
, Patrick J. Minx
3
& Sandra W. Clifton
3
US DOE Joint Genome Institute: Trevor Hawkins
4
,
Elbert Branscomb
4
, Paul Predki
4
, Paul Richardson
4
,
Sarah Wenning
4
, Tom Slezak
4
, Norman Doggett
4
, Jan-Fang Cheng
4
,
Anne Olsen
4
, Susan Lucas
4
, Christopher Elkin
4
,
Edward Uberbacher
4
& Marvin Frazier
4
Baylor College of Medicine Human Genome Sequencing Center:
Richard A. Gibbs
5
*, Donna M. Muzny
5
, Steven E. Scherer
5
,
John B. Bouck
5
*, Erica J. Sodergren
5
, Kim C. Worley
5
*, Catherine M.
Rives
5
, James H. Gorrell
5
, Michael L. Metzker
5
,
Susan L. Naylor
6
, Raju S. Kucherlapati
7
, David L. Nelson,
& George M. Weinstock
8
RIKEN Genomic Sciences Center: Yoshiyuki Sakaki
9
,
Asao Fujiyama
9
, Masahira Hattori
9
, Tetsushi Yada
9
,
Atsushi Toyoda
9
, Takehiko Itoh
9
, Chiharu Kawagoe
9
,
Hidemi Watanabe
9
, Yasushi Totoki
9
& Todd Taylor
9
Genoscope and CNRS UMR-8030: Jean Weissenbach
10
,
Roland Heilig
10
, William Saurin
10
, Francois Artiguenave
10
,
Philippe Brottier
10
, Thomas Bruls
10
, Eric Pelletier
10
,
Catherine Robert
10
& Patrick Wincker
10
GTC Sequencing Center: Douglas R. Smith
11
,
Lynn Doucette-Stamm
11
, Marc Ruben®eld
11
, Keith Weinstock
11
,
Hong Mei Lee
11
& JoAnn Dubois
11
Department of Genome Analysis, Institute of Molecular
Biotechnology: Andre
Â
Rosenthal
12
, Matthias Platzer
12
,
Gerald Nyakatura
12
, Stefan Taudien
12
& Andreas Rump
12
Beijing Genomics Institute/Human Genome Center:
Huanming Yang
13
, Jun Yu
13
, Jian Wang
13
, Guyang Huang
14
& Jun Gu
15
Multimegabase Sequencing Center, The Institute for Systems
Biology: Leroy Hood
16
, Lee Rowen
16
, Anup Madan
16
& Shizen Qin
16
Stanford Genome Technology Center: Ronald W. Davis
17
,
Nancy A. Federspiel
17
, A. Pia Abola
17
& Michael J. Proctor
17
Stanford Human Genome Center: Richard M. Myers
18
,
Jeremy Schmutz
18
, Mark Dickson
18
, Jane Grimwood
18
& David R. Cox
18
University of Washington Genome Center: Maynard V. Olson
19
,
Rajinder Kaul
19
& Christopher Raymond
19
Department of Molecular Biology, Keio University School of
Medicine: Nobuyoshi Shimizu
20
, Kazuhiko Kawasaki
20
& Shinsei Minoshima
20
University of Texas Southwestern Medical Center at Dallas:
Glen A. Evans
21
²
, Maria Athanasiou
21
& Roger Schultz
21
University of Oklahoma's Advanced Center for Genome
Technology: Bruce A. Roe
22
, Feng Chen
22
& Huaqin Pan
22
Max Planck Institute for Molecular Genetics: Juliane Ramser
23
,
Hans Lehrach
23
& Richard Reinhardt
23
Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome
Center: W. Richard McCombie
24
, Melissa de la Bastide
24
& Neilay Dedhia
24
GBFÐGerman Research Centre for Biotechnology:
Helmut Blo
È
cker
25
, Klaus Hornischer
25
& Gabriele Nordsiek
25
* Genome Analysis Group (listed in alphabetical order, also
includes individuals listed under other headings):
Richa Agarwala
26
, L. Aravind
26
, Jeffrey A. Bailey
27
, Alex Bateman
2
,
Sera®m Batzoglou
1
, Ewan Birney
28
, Peer Bork
29,30
, Daniel G. Brown
1
,
Christopher B. Burge
31
, Lorenzo Cerutti
28
, Hsiu-Chuan Chen
26
,
Deanna Church
26
, Michele Clamp
2
, Richard R. Copley
30
,
Tobias Doerks
29,30
, Sean R. Eddy
32
, Evan E. Eichler
27
,
Terrence S. Furey
33
, James Galagan
1
, James G. R. Gilbert
2
,
Cyrus Harmon
34
, Yoshihide Hayashizaki
35
, David Haussler
36
,
Henning Hermjakob
28
, Karsten Hokamp
37
, Wonhee Jang
26
,
L. Steven Johnson
32
, Thomas A. Jones
32
, Simon Kasif
38
,
Arek Kaspryzk
28
, Scot Kennedy
39
, W. James Kent
40
, Paul Kitts
26
,
Eugene V. Koonin
26
, Ian Korf
3
, David Kulp
34
, Doron Lancet
41
,
Todd M. Lowe
42
, Aoife McLysaght
37
, Tarjei Mikkelsen
38
,
John V. Moran
43
, Nicola Mulder
28
, Victor J. Pollara
1
,
Chris P. Ponting
44
, Greg Schuler
26
, Jo
È
rg Schultz
30
, Guy Slater
28
,
Arian F. A. Smit
45
, Elia Stupka
28
, Joseph Szustakowki
38
,
Danielle Thierry-Mieg
26
, Jean Thierry-Mieg
26
, Lukas Wagner
26
,
John Wallis
3
, Raymond Wheeler
34
, Alan Williams
34
, Yuri I. Wolf
26
,
Kenneth H. Wolfe
37
, Shiaw-Pyng Yang
3
& Ru-Fang Yeh
31
Scienti®c management: National Human Genome Research
Institute, US National Institutes of Health: Francis Collins
46
*,
Mark S. Guyer
46
, Jane Peterson
46
, Adam Felsenfeld
46
*
& Kris A. Wetterstrand
46
; Of®ce of Science, US Department of
Energy: Aristides Patrinos
47
; The Wellcome Trust: Michael J.
Morgan
48
© 2001 Macmillan Magazines Ltd
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 861
Genome Sequencing Centres (Listed in order of total genomic
sequence contributed, with a partial list of personnel. A full list of
contributors at each centre is available as Supplementary
Information.)
Whitehead Institute for Biomedical Research, Center for Genome
Research: Eric S. Lander
1
*, Lauren M. Linton
1
, Bruce Birren
1
*,
Chad Nusbaum
1
*, Michael C. Zody
1
*, Jennifer Baldwin
1
,
Keri Devon
1
, Ken Dewar
1
, Michael Doyle
1
, William FitzHugh
1
*,
Roel Funke
1
, Diane Gage
1
, Katrina Harris
1
, Andrew Heaford
1
,
John Howland
1
, Lisa Kann
1
, Jessica Lehoczky
1
, Rosie LeVine
1
,
Paul McEwan
1
, Kevin McKernan
1
, James Meldrim
1
, Jill P. Mesirov
1
*,
Cher Miranda
1
, William Morris
1
, Jerome Naylor
1
,
Christina Raymond
1
, Mark Rosetti
1
, Ralph Santos
1
,
Andrew Sheridan
1
, Carrie Sougnez
1
, Nicole Stange-Thomann
1
,
Nikola Stojanovic
1
, Aravind Subramanian
1
& Dudley Wyman
1
The Sanger Centre: Jane Rogers
2
, John Sulston
2
*,
Rachael Ainscough
2
, Stephan Beck
2
, David Bentley
2
, John Burton
2
,
Christopher Clee
2
, Nigel Carter
2
, Alan Coulson
2
,
Rebecca Deadman
2
, Panos Deloukas
2
, Andrew Dunham
2
,
Ian Dunham
2
, Richard Durbin
2
*, Lisa French
2
, Darren Grafham
2
,
Simon Gregory
2
, Tim Hubbard
2
*, Sean Humphray
2
, Adrienne Hunt
2
,
Matthew Jones
2
, Christine Lloyd
2
, Amanda McMurray
2
,
Lucy Matthews
2
, Simon Mercer
2
, Sarah Milne
2
, James C. Mullikin
2
*,
Andrew Mungall
2
, Robert Plumb
2
, Mark Ross
2
, Ratna Shownkeen
2
& Sarah Sims
2
Washington University Genome Sequencing Center:
Robert H. Waterston
3
*, Richard K. Wilson
3
, LaDeana W. Hillier
3
*,
John D. McPherson
3
, Marco A. Marra
3
, Elaine R. Mardis
3
,
Lucinda A. Fulton
3
, Asif T. Chinwalla
3
*, Kymberlie H. Pepin
3
,
Warren R. Gish
3
, Stephanie L. Chissoe
3
, Michael C. Wendl
3
,
Kim D. Delehaunty
3
, Tracie L. Miner
3
, Andrew Delehaunty
3
,
Jason B. Kramer
3
, Lisa L. Cook
3
, Robert S. Fulton
3
,
Douglas L. Johnson
3
, Patrick J. Minx
3
& Sandra W. Clifton
3
US DOE Joint Genome Institute: Trevor Hawkins
4
,
Elbert Branscomb
4
, Paul Predki
4
, Paul Richardson
4
,
Sarah Wenning
4
, Tom Slezak
4
, Norman Doggett
4
, Jan-Fang Cheng
4
,
Anne Olsen
4
, Susan Lucas
4
, Christopher Elkin
4
,
Edward Uberbacher
4
& Marvin Frazier
4
Baylor College of Medicine Human Genome Sequencing Center:
Richard A. Gibbs
5
*, Donna M. Muzny
5
, Steven E. Scherer
5
,
John B. Bouck
5
*, Erica J. Sodergren
5
, Kim C. Worley
5
*, Catherine M.
Rives
5
, James H. Gorrell
5
, Michael L. Metzker
5
,
Susan L. Naylor
6
, Raju S. Kucherlapati
7
, David L. Nelson,
& George M. Weinstock
8
RIKEN Genomic Sciences Center: Yoshiyuki Sakaki
9
,
Asao Fujiyama
9
, Masahira Hattori
9
, Tetsushi Yada
9
,
Atsushi Toyoda
9
, Takehiko Itoh
9
, Chiharu Kawagoe
9
,
Hidemi Watanabe
9
, Yasushi Totoki
9
& Todd Taylor
9
Genoscope and CNRS UMR-8030: Jean Weissenbach
10
,
Roland Heilig
10
, William Saurin
10
, Francois Artiguenave
10
,
Philippe Brottier
10
, Thomas Bruls
10
, Eric Pelletier
10
,
Catherine Robert
10
& Patrick Wincker
10
GTC Sequencing Center: Douglas R. Smith
11
,
Lynn Doucette-Stamm
11
, Marc Ruben®eld
11
, Keith Weinstock
11
,
Hong Mei Lee
11
& JoAnn Dubois
11
Department of Genome Analysis, Institute of Molecular
Biotechnology: Andre
Â
Rosenthal
12
, Matthias Platzer
12
,
Gerald Nyakatura
12
, Stefan Taudien
12
& Andreas Rump
12
Beijing Genomics Institute/Human Genome Center:
Huanming Yang
13
, Jun Yu
13
, Jian Wang
13
, Guyang Huang
14
& Jun Gu
15
Multimegabase Sequencing Center, The Institute for Systems
Biology: Leroy Hood
16
, Lee Rowen
16
, Anup Madan
16
& Shizen Qin
16
Stanford Genome Technology Center: Ronald W. Davis
17
,
Nancy A. Federspiel
17
, A. Pia Abola
17
& Michael J. Proctor
17
Stanford Human Genome Center: Richard M. Myers
18
,
Jeremy Schmutz
18
, Mark Dickson
18
, Jane Grimwood
18
& David R. Cox
18
University of Washington Genome Center: Maynard V. Olson
19
,
Rajinder Kaul
19
& Christopher Raymond
19
Department of Molecular Biology, Keio University School of
Medicine: Nobuyoshi Shimizu
20
, Kazuhiko Kawasaki
20
& Shinsei Minoshima
20
University of Texas Southwestern Medical Center at Dallas:
Glen A. Evans
21
²
, Maria Athanasiou
21
& Roger Schultz
21
University of Oklahoma's Advanced Center for Genome
Technology: Bruce A. Roe
22
, Feng Chen
22
& Huaqin Pan
22
Max Planck Institute for Molecular Genetics: Juliane Ramser
23
,
Hans Lehrach
23
& Richard Reinhardt
23
Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome
Center: W. Richard McCombie
24
, Melissa de la Bastide
24
& Neilay Dedhia
24
GBFÐGerman Research Centre for Biotechnology:
Helmut Blo
È
cker
25
, Klaus Hornischer
25
& Gabriele Nordsiek
25
* Genome Analysis Group (listed in alphabetical order, also
includes individuals listed under other headings):
Richa Agarwala
26
, L. Aravind
26
, Jeffrey A. Bailey
27
, Alex Bateman
2
,
Sera®m Batzoglou
1
, Ewan Birney
28
, Peer Bork
29,30
, Daniel G. Brown
1
,
Christopher B. Burge
31
, Lorenzo Cerutti
28
, Hsiu-Chuan Chen
26
,
Deanna Church
26
, Michele Clamp
2
, Richard R. Copley
30
,
Tobias Doerks
29,30
, Sean R. Eddy
32
, Evan E. Eichler
27
,
Terrence S. Furey
33
, James Galagan
1
, James G. R. Gilbert
2
,
Cyrus Harmon
34
, Yoshihide Hayashizaki
35
, David Haussler
36
,
Henning Hermjakob
28
, Karsten Hokamp
37
, Wonhee Jang
26
,
L. Steven Johnson
32
, Thomas A. Jones
32
, Simon Kasif
38
,
Arek Kaspryzk
28
, Scot Kennedy
39
, W. James Kent
40
, Paul Kitts
26
,
Eugene V. Koonin
26
, Ian Korf
3
, David Kulp
34
, Doron Lancet
41
,
Todd M. Lowe
42
, Aoife McLysaght
37
, Tarjei Mikkelsen
38
,
John V. Moran
43
, Nicola Mulder
28
, Victor J. Pollara
1
,
Chris P. Ponting
44
, Greg Schuler
26
, Jo
È
rg Schultz
30
, Guy Slater
28
,
Arian F. A. Smit
45
, Elia Stupka
28
, Joseph Szustakowki
38
,
Danielle Thierry-Mieg
26
, Jean Thierry-Mieg
26
, Lukas Wagner
26
,
John Wallis
3
, Raymond Wheeler
34
, Alan Williams
34
, Yuri I. Wolf
26
,
Kenneth H. Wolfe
37
, Shiaw-Pyng Yang
3
& Ru-Fang Yeh
31
Scienti®c management: National Human Genome Research
Institute, US National Institutes of Health: Francis Collins
46
*,
Mark S. Guyer
46
, Jane Peterson
46
, Adam Felsenfeld
46
*
& Kris A. Wetterstrand
46
; Of®ce of Science, US Department of
Energy: Aristides Patrinos
47
; The Wellcome Trust: Michael J.
Morgan
48
© 2001 Macmillan Magazines Ltd
Page 3
organisms; and the history of genomic segments. (Comparisons
are drawn throughout with the genomes of the budding yeast
Saccharomyces cerevisiae, the nematode worm Caenorhabditis
elegans, the fruit¯y Drosophila melanogaster and the mustard weed
Arabidopsis thaliana; we refer to these for convenience simply as
yeast, worm, ¯y and mustard weed.) Finally, we discuss applications
of the sequence to biology and medicine and describe next steps in
the project. A full description of the methods is provided as
Supplementary Information on Nature's web site (http://www.
nature.com).
We recognize that it is impossible to provide a comprehensive
analysis of this vast dataset, and thus our goal is to illustrate the
range of insights that can be gleaned from the human genome and
thereby to sketch a research agenda for the future.
Background to the Human Genome Project
The Human Genome Project arose from two key insights that
emerged in the early 1980s: that the ability to take global views of
genomes could greatly accelerate biomedical research, by allowing
researchers to attack problems in a comprehensive and unbiased
fashion; and that the creation of such global views would require a
communal effort in infrastructure building, unlike anything pre-
viously attempted in biomedical research. Several key projects
helped to crystallize these insights, including:
(1) The sequencing of the bacterial viruses FX174
4,5
and lambda
6
, the
animal virus SV40
7
and the human mitochondrion
8
between 1977
and 1982. These projects proved the feasibility of assembling small
sequence fragments into complete genomes, and showed the value
of complete catalogues of genes and other functional elements.
(2) The programme to create a human genetic map to make it
possible to locate disease genes of unknown function based solely on
their inheritance patterns, launched by Botstein and colleagues in
1980 (ref. 9).
(3) The programmes to create physical maps of clones covering the
yeast
10
and worm
11
genomes to allow isolation of genes and regions
based solely on their chromosomal position, launched by Olson and
Sulston in the mid-1980s.
(4) The development of random shotgun sequencing of comple-
mentary DNA fragments for high-throughput gene discovery by
Schimmel
12
and Schimmel and Sutcliffe
13
, later dubbed expressed
sequence tags (ESTs) and pursued with automated sequencing by
Venter and others
14±20
.
The idea of sequencing the entire human genome was ®rst
proposed in discussions at scienti®c meetings organized by the
US Department of Energy and others from 1984 to 1986 (refs 21,
22). A committee appointed by the US National Research Council
endorsed the concept in its 1988 report
23
, but recommended a
broader programme, to include: the creation of genetic, physical
and sequence maps of the human genome; parallel efforts in key
model organisms such as bacteria, yeast, worms, ¯ies and mice; the
development of technology in support of these objectives; and
research into the ethical, legal and social issues raised by human
genome research. The programme was launched in the US as a joint
effort of the Department of Energy and the National Institutes of
Health. In other countries, the UK Medical Research Council and
the Wellcome Trust supported genomic research in Britain; the
Centre d'Etude du Polymorphisme Humain and the French Mus-
cular Dystrophy Association launched mapping efforts in France;
government agencies, including the Science and Technology Agency
and the Ministry of Education, Science, Sports and Culture sup-
ported genomic research efforts in Japan; and the European Com-
munity helped to launch several international efforts, notably the
programme to sequence the yeast genome. By late 1990, the Human
Genome Project had been launched, with the creation of genome
centres in these countries. Additional participants subsequently
joined the effort, notably in Germany and China. In addition, the
Human Genome Organization (HUGO) was founded to provide a
forum for international coordination of genomic research. Several
books
24±26
provide a more comprehensive discussion of the genesis
of the Human Genome Project.
Through 1995, work progressed rapidly on two fronts (Fig. 1).
The ®rst was construction of genetic and physical maps of the
human and mouse genomes
27±31
, providing key tools for identi®ca-
tion of disease genes and anchoring points for genomic sequence.
The second was sequencing of the yeast
32
and worm
33
genomes, as
articles
862 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
1984 1990 1991 1992 1993 1994 1995 1996 1997 1998 2000 20011999
Bacterial genome sequencing
H. flu E. coli 39 species
S. cerevisiae sequencing
C. elegans sequencing
D. melanogaster sequencing
A. thaliana sequencing
Microsatellites
ESTs
cDNA sequencing
Genetic maps
Physical maps
Genetic maps
Physical maps
Genomic sequencing
cDNA sequencing
Genomic sequencing
Full length
ESTs Full length
SNPs
Microsatellites
Pilot project,15%
Chromosome 22 Chromosome 21
Working
draft, 90%
SNPs
Pilot
sequencing
Finishing, ~100%
Discussion and debate
in scientific community
NRC report
O
th
er
o
rg
an
is
m
s
M
ou
se
H
um
an
Figure 1 Timeline of large-scale genomic analyses. Shown are selected components of
work on several non-vertebrate model organisms (red), the mouse (blue) and the human
(green) from 1990; earlier projects are described in the text. SNPs, single nucleotide
polymorphisms; ESTs, expressed sequence tags.
© 2001 Macmillan Magazines Ltd
are drawn throughout with the genomes of the budding yeast
Saccharomyces cerevisiae, the nematode worm Caenorhabditis
elegans, the fruit¯y Drosophila melanogaster and the mustard weed
Arabidopsis thaliana; we refer to these for convenience simply as
yeast, worm, ¯y and mustard weed.) Finally, we discuss applications
of the sequence to biology and medicine and describe next steps in
the project. A full description of the methods is provided as
Supplementary Information on Nature's web site (http://www.
nature.com).
We recognize that it is impossible to provide a comprehensive
analysis of this vast dataset, and thus our goal is to illustrate the
range of insights that can be gleaned from the human genome and
thereby to sketch a research agenda for the future.
Background to the Human Genome Project
The Human Genome Project arose from two key insights that
emerged in the early 1980s: that the ability to take global views of
genomes could greatly accelerate biomedical research, by allowing
researchers to attack problems in a comprehensive and unbiased
fashion; and that the creation of such global views would require a
communal effort in infrastructure building, unlike anything pre-
viously attempted in biomedical research. Several key projects
helped to crystallize these insights, including:
(1) The sequencing of the bacterial viruses FX174
4,5
and lambda
6
, the
animal virus SV40
7
and the human mitochondrion
8
between 1977
and 1982. These projects proved the feasibility of assembling small
sequence fragments into complete genomes, and showed the value
of complete catalogues of genes and other functional elements.
(2) The programme to create a human genetic map to make it
possible to locate disease genes of unknown function based solely on
their inheritance patterns, launched by Botstein and colleagues in
1980 (ref. 9).
(3) The programmes to create physical maps of clones covering the
yeast
10
and worm
11
genomes to allow isolation of genes and regions
based solely on their chromosomal position, launched by Olson and
Sulston in the mid-1980s.
(4) The development of random shotgun sequencing of comple-
mentary DNA fragments for high-throughput gene discovery by
Schimmel
12
and Schimmel and Sutcliffe
13
, later dubbed expressed
sequence tags (ESTs) and pursued with automated sequencing by
Venter and others
14±20
.
The idea of sequencing the entire human genome was ®rst
proposed in discussions at scienti®c meetings organized by the
US Department of Energy and others from 1984 to 1986 (refs 21,
22). A committee appointed by the US National Research Council
endorsed the concept in its 1988 report
23
, but recommended a
broader programme, to include: the creation of genetic, physical
and sequence maps of the human genome; parallel efforts in key
model organisms such as bacteria, yeast, worms, ¯ies and mice; the
development of technology in support of these objectives; and
research into the ethical, legal and social issues raised by human
genome research. The programme was launched in the US as a joint
effort of the Department of Energy and the National Institutes of
Health. In other countries, the UK Medical Research Council and
the Wellcome Trust supported genomic research in Britain; the
Centre d'Etude du Polymorphisme Humain and the French Mus-
cular Dystrophy Association launched mapping efforts in France;
government agencies, including the Science and Technology Agency
and the Ministry of Education, Science, Sports and Culture sup-
ported genomic research efforts in Japan; and the European Com-
munity helped to launch several international efforts, notably the
programme to sequence the yeast genome. By late 1990, the Human
Genome Project had been launched, with the creation of genome
centres in these countries. Additional participants subsequently
joined the effort, notably in Germany and China. In addition, the
Human Genome Organization (HUGO) was founded to provide a
forum for international coordination of genomic research. Several
books
24±26
provide a more comprehensive discussion of the genesis
of the Human Genome Project.
Through 1995, work progressed rapidly on two fronts (Fig. 1).
The ®rst was construction of genetic and physical maps of the
human and mouse genomes
27±31
, providing key tools for identi®ca-
tion of disease genes and anchoring points for genomic sequence.
The second was sequencing of the yeast
32
and worm
33
genomes, as
articles
862 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
1984 1990 1991 1992 1993 1994 1995 1996 1997 1998 2000 20011999
Bacterial genome sequencing
H. flu E. coli 39 species
S. cerevisiae sequencing
C. elegans sequencing
D. melanogaster sequencing
A. thaliana sequencing
Microsatellites
ESTs
cDNA sequencing
Genetic maps
Physical maps
Genetic maps
Physical maps
Genomic sequencing
cDNA sequencing
Genomic sequencing
Full length
ESTs Full length
SNPs
Microsatellites
Pilot project,15%
Chromosome 22 Chromosome 21
Working
draft, 90%
SNPs
Pilot
sequencing
Finishing, ~100%
Discussion and debate
in scientific community
NRC report
O
th
er
o
rg
an
is
m
s
M
ou
se
H
um
an
Figure 1 Timeline of large-scale genomic analyses. Shown are selected components of
work on several non-vertebrate model organisms (red), the mouse (blue) and the human
(green) from 1990; earlier projects are described in the text. SNPs, single nucleotide
polymorphisms; ESTs, expressed sequence tags.
© 2001 Macmillan Magazines Ltd
Page 4
well as targeted regions of mammalian genomes
34±37
. These projects
showed that large-scale sequencing was feasible and developed the
two-phase paradigm for genome sequencing. In the ®rst, `shotgun',
phase, the genome is divided into appropriately sized segments and
each segment is covered to a high degree of redundancy (typically,
eight- to tenfold) through the sequencing of randomly selected
subfragments. The second is a `®nishing' phase, in which sequence
gaps are closed and remaining ambiguities are resolved through
directed analysis. The results also showed that complete genomic
sequence provided information about genes, regulatory regions and
chromosome structure that was not readily obtainable from cDNA
studies alone.
In 1995, genome scientists considered a proposal
38
that would
have involved producing a draft genome sequence of the human
genome in a ®rst phase and then returning to ®nish the sequence in
a second phase. After vigorous debate, it was decided that such a
plan was premature for several reasons. These included the need ®rst
to prove that high-quality, long-range ®nished sequence could be
produced from most parts of the complex, repeat-rich human
genome; the sense that many aspects of the sequencing process
were still rapidly evolving; and the desirability of further decreasing
costs.
Instead, pilot projects were launched to demonstrate the feasi-
bility of cost-effective, large-scale sequencing, with a target comple-
tion date of March 1999. The projects successfully produced
®nished sequence with 99.99% accuracy and no gaps
39
. They also
introduced bacterial arti®cial chromosomes (BACs)
40
, a new large-
insert cloning system that proved to be more stable than the cosmids
and yeast arti®cial chromosomes (YACs)
41
that had been used
previously. The pilot projects drove the maturation and conver-
gence of sequencing strategies, while producing 15% of the human
genome sequence. With successful completion of this phase, the
human genome sequencing effort moved into full-scale production
in March 1999.
The idea of ®rst producing a draft genome sequence was revived
at this time, both because the ability to ®nish such a sequence was no
longer in doubt and because there was great hunger in the scienti®c
community for human sequence data. In addition, some scientists
favoured prioritizing the production of a draft genome sequence
over regional ®nished sequence because of concerns about com-
mercial plans to generate proprietary databases of human sequence
that might be subject to undesirable restrictions on use
42±44
.
The consortium focused on an initial goal of producing, in a ®rst
production phase lasting until June 2000, a draft genome sequence
covering most of the genome. Such a draft genome sequence,
although not completely ®nished, would rapidly allow investigators
to begin to extract most of the information in the human sequence.
Experiments showed that sequencing clones covering about 90% of
the human genome to a redundancy of about four- to ®vefold (`half-
shotgun' coverage; see Box 1) would accomplish this
45,46
. The draft
genome sequence goal has been achieved, as described below.
The second sequence production phase is now under way. Its
aims are to achieve full-shotgun coverage of the existing clones
during 2001, to obtain clones to ®ll the remaining gaps in the
physical map, and to produce a ®nished sequence (apart from
regions that cannot be cloned or sequenced with currently available
techniques) no later than 2003.
Strategic issues
Hierarchical shotgun sequencing
Soon after the invention of DNA sequencing methods
47,48
, the
shotgun sequencing strategy was introduced
49±51
; it has remained
the fundamental method for large-scale genome sequencing
52±54
for
the past 20 years. The approach has been re®ned and extended to
make it more ef®cient. For example, improved protocols for
fragmenting and cloning DNA allowed construction of shotgun
libraries with more uniform representation. The practice of sequen-
cing from both ends of double-stranded clones (`double-barrelled'
shotgun sequencing) was introduced by Ansorge and others
37
in
1990, allowing the use of `linking information' between sequence
fragments.
The application of shotgun sequencing was also extended by
applying it to larger and larger DNA moleculesÐfrom plasmids
(, 4 kilobases (kb)) to cosmid clones
37
(40 kb), to arti®cial chro-
mosomes cloned in bacteria and yeast
55
(100±500 kb) and bacterial
genomes
56
(1±2 megabases (Mb)). In principle, a genome of arbi-
trary size may be directly sequenced by the shotgun method,
provided that it contains no repeated sequence and can be uni-
formly sampled at random. The genome can then be assembled
using the simple computer science technique of `hashing' (in which
one detects overlaps by consulting an alphabetized look-up table of
all k-letter words in the data). Mathematical analysis of the
expected number of gaps as a function of coverage is similarly
straightforward
57
.
Practical dif®culties arise because of repeated sequences and
cloning bias. Small amounts of repeated sequence pose little
problem for shotgun sequencing. For example, one can readily
assemble typical bacterial genomes (about 1.5% repeat) or the
euchromatic portion of the ¯y genome (about 3% repeat). By
contrast, the human genome is ®lled (. 50%) with repeated
sequences, including interspersed repeats derived from transposable
elements, and long genomic regions that have been duplicated in
tandem, palindromic or dispersed fashion (see below). These
include large duplicated segments (50±500 kb) with high sequence
identity (98±99.9%), at which mispairing during recombination
creates deletions responsible for genetic syndromes. Such features
complicate the assembly of a correct and ®nished genome sequence.
There are two approaches for sequencing large repeat-rich
genomes. The ®rst is a whole-genome shotgun sequencing
approach, as has been used for the repeat-poor genomes of viruses,
bacteria and ¯ies, using linking information and computational
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 863
Genomic DNA
BAC library
Organized
mapped large
clone contigs
BAC to be
sequenced
Shotgun
clones
Assembly
Shotgun
sequence
...ACCGTAAATGGGCTGATCATGCTTAAA
...ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG...
TGATCATGCTTAAACCCTGTGCATCCTACTG...
Hierarchical shotgun sequencing
Figure 2 Idealized representation of the hierarchical shotgun sequencing strategy. A
library is constructed by fragmenting the target genome and cloning it into a large-
fragment cloning vector; here, BAC vectors are shown. The genomic DNA fragments
represented in the library are then organized into a physical map and individual BAC
clones are selected and sequenced by the random shotgun strategy. Finally, the clone
sequences are assembled to reconstruct the sequence of the genome.
© 2001 Macmillan Magazines Ltd
34±37
. These projects
showed that large-scale sequencing was feasible and developed the
two-phase paradigm for genome sequencing. In the ®rst, `shotgun',
phase, the genome is divided into appropriately sized segments and
each segment is covered to a high degree of redundancy (typically,
eight- to tenfold) through the sequencing of randomly selected
subfragments. The second is a `®nishing' phase, in which sequence
gaps are closed and remaining ambiguities are resolved through
directed analysis. The results also showed that complete genomic
sequence provided information about genes, regulatory regions and
chromosome structure that was not readily obtainable from cDNA
studies alone.
In 1995, genome scientists considered a proposal
38
that would
have involved producing a draft genome sequence of the human
genome in a ®rst phase and then returning to ®nish the sequence in
a second phase. After vigorous debate, it was decided that such a
plan was premature for several reasons. These included the need ®rst
to prove that high-quality, long-range ®nished sequence could be
produced from most parts of the complex, repeat-rich human
genome; the sense that many aspects of the sequencing process
were still rapidly evolving; and the desirability of further decreasing
costs.
Instead, pilot projects were launched to demonstrate the feasi-
bility of cost-effective, large-scale sequencing, with a target comple-
tion date of March 1999. The projects successfully produced
®nished sequence with 99.99% accuracy and no gaps
39
. They also
introduced bacterial arti®cial chromosomes (BACs)
40
, a new large-
insert cloning system that proved to be more stable than the cosmids
and yeast arti®cial chromosomes (YACs)
41
that had been used
previously. The pilot projects drove the maturation and conver-
gence of sequencing strategies, while producing 15% of the human
genome sequence. With successful completion of this phase, the
human genome sequencing effort moved into full-scale production
in March 1999.
The idea of ®rst producing a draft genome sequence was revived
at this time, both because the ability to ®nish such a sequence was no
longer in doubt and because there was great hunger in the scienti®c
community for human sequence data. In addition, some scientists
favoured prioritizing the production of a draft genome sequence
over regional ®nished sequence because of concerns about com-
mercial plans to generate proprietary databases of human sequence
that might be subject to undesirable restrictions on use
42±44
.
The consortium focused on an initial goal of producing, in a ®rst
production phase lasting until June 2000, a draft genome sequence
covering most of the genome. Such a draft genome sequence,
although not completely ®nished, would rapidly allow investigators
to begin to extract most of the information in the human sequence.
Experiments showed that sequencing clones covering about 90% of
the human genome to a redundancy of about four- to ®vefold (`half-
shotgun' coverage; see Box 1) would accomplish this
45,46
. The draft
genome sequence goal has been achieved, as described below.
The second sequence production phase is now under way. Its
aims are to achieve full-shotgun coverage of the existing clones
during 2001, to obtain clones to ®ll the remaining gaps in the
physical map, and to produce a ®nished sequence (apart from
regions that cannot be cloned or sequenced with currently available
techniques) no later than 2003.
Strategic issues
Hierarchical shotgun sequencing
Soon after the invention of DNA sequencing methods
47,48
, the
shotgun sequencing strategy was introduced
49±51
; it has remained
the fundamental method for large-scale genome sequencing
52±54
for
the past 20 years. The approach has been re®ned and extended to
make it more ef®cient. For example, improved protocols for
fragmenting and cloning DNA allowed construction of shotgun
libraries with more uniform representation. The practice of sequen-
cing from both ends of double-stranded clones (`double-barrelled'
shotgun sequencing) was introduced by Ansorge and others
37
in
1990, allowing the use of `linking information' between sequence
fragments.
The application of shotgun sequencing was also extended by
applying it to larger and larger DNA moleculesÐfrom plasmids
(, 4 kilobases (kb)) to cosmid clones
37
(40 kb), to arti®cial chro-
mosomes cloned in bacteria and yeast
55
(100±500 kb) and bacterial
genomes
56
(1±2 megabases (Mb)). In principle, a genome of arbi-
trary size may be directly sequenced by the shotgun method,
provided that it contains no repeated sequence and can be uni-
formly sampled at random. The genome can then be assembled
using the simple computer science technique of `hashing' (in which
one detects overlaps by consulting an alphabetized look-up table of
all k-letter words in the data). Mathematical analysis of the
expected number of gaps as a function of coverage is similarly
straightforward
57
.
Practical dif®culties arise because of repeated sequences and
cloning bias. Small amounts of repeated sequence pose little
problem for shotgun sequencing. For example, one can readily
assemble typical bacterial genomes (about 1.5% repeat) or the
euchromatic portion of the ¯y genome (about 3% repeat). By
contrast, the human genome is ®lled (. 50%) with repeated
sequences, including interspersed repeats derived from transposable
elements, and long genomic regions that have been duplicated in
tandem, palindromic or dispersed fashion (see below). These
include large duplicated segments (50±500 kb) with high sequence
identity (98±99.9%), at which mispairing during recombination
creates deletions responsible for genetic syndromes. Such features
complicate the assembly of a correct and ®nished genome sequence.
There are two approaches for sequencing large repeat-rich
genomes. The ®rst is a whole-genome shotgun sequencing
approach, as has been used for the repeat-poor genomes of viruses,
bacteria and ¯ies, using linking information and computational
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 863
Genomic DNA
BAC library
Organized
mapped large
clone contigs
BAC to be
sequenced
Shotgun
clones
Assembly
Shotgun
sequence
...ACCGTAAATGGGCTGATCATGCTTAAA
...ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG...
TGATCATGCTTAAACCCTGTGCATCCTACTG...
Hierarchical shotgun sequencing
Figure 2 Idealized representation of the hierarchical shotgun sequencing strategy. A
library is constructed by fragmenting the target genome and cloning it into a large-
fragment cloning vector; here, BAC vectors are shown. The genomic DNA fragments
represented in the library are then organized into a physical map and individual BAC
clones are selected and sequenced by the random shotgun strategy. Finally, the clone
sequences are assembled to reconstruct the sequence of the genome.
© 2001 Macmillan Magazines Ltd
Page 5
analysis to attempt to avoid misassemblies. The second is the
`hierarchical shotgun sequencing' approach (Fig. 2), also referred
to as `map-based', `BAC-based' or `clone-by-clone'. This approach
involves generating and organizing a set of large-insert clones
(typically 100±200 kb each) covering the genome and separately
performing shotgun sequencing on appropriately chosen clones.
Because the sequence information is local, the issue of long-range
misassembly is eliminated and the risk of short-range misassembly
is reduced. One caveat is that some large-insert clones may suffer
rearrangement, although this risk can be reduced by appropriate
quality-control measures involving clone ®ngerprints (see below).
The two methods are likely to entail similar costs for producing
®nished sequence of a mammalian genome. The hierarchical
approach has a higher initial cost than the whole-genome approach,
owing to the need to create a map of clones (about 1% of the total
cost of sequencing) and to sequence overlaps between clones. On
the other hand, the whole-genome approach is likely to require
much greater work and expense in the ®nal stage of producing a
®nished sequence, because of the challenge of resolving misassem-
blies. Both methods must also deal with cloning biases, resulting in
under-representation of some regions in either large-insert or
small-insert clone libraries.
There was lively scienti®c debate over whether the human
genome sequencing effort should employ whole-genome or hier-
archical shotgun sequencing. Weber and Myers
58
stimulated these
discussions with a speci®c proposal for a whole-genome shotgun
approach, together with an analysis suggesting that the method
could work and be more ef®cient. Green
59
challenged these conclu-
sions and argued that the potential bene®ts did not outweigh the
likely risks.
In the end, we concluded that the human genome sequencing
effort should employ the hierarchical approach for several reasons.
First, it was prudent to use the approach for the ®rst project to
sequence a repeat-rich genome. With the hierarchical approach, the
ultimate frequency of misassembly in the ®nished product would
probably be lower than with the whole-genome approach, in which
it would be more dif®cult to identify regions in which the assembly
was incorrect.
Second, it was prudent to use the approach in dealing with an
outbred organism, such as the human. In the whole-genome shot-
gun method, sequence would necessarily come from two different
copies of the human genome. Accurate sequence assembly could be
complicated by sequence variation between these two copiesÐboth
SNPs (which occur at a rate of 1 per 1,300 bases) and larger-scale
structural heterozygosity (which has been documented in human
chromosomes). In the hierarchical shotgun method, each large-
insert clone is derived from a single haplotype.
Third, the hierarchical method would be better able to deal with
inevitable cloning biases, because it would more readily allow
targeting of additional sequencing to under-represented regions.
And fourth, it was better suited to a project shared among members
of a diverse international consortium, because it allowed work and
responsibility to be easily distributed. As the ultimate goal has
always been to create a high-quality, ®nished sequence to serve as a
foundation for biomedical research, we reasoned that the advan-
tages of this more conservative approach outweighed the additional
cost, if any.
A biotechnology company, Celera Genomics, has chosen to
incorporate the whole-genome shotgun approach into its own
efforts to sequence the human genome. Their plan
60,61
uses a
mixed strategy, involving combining some coverage with whole-
genome shotgun data generated by the company together with the
publicly available hierarchical shotgun data generated by the Inter-
national Human Genome Sequencing Consortium. If the raw
sequence reads from the whole-genome shotgun component are
made available, it may be possible to evaluate the extent to which the
sequence of the human genome can be assembled without the need
for clone-based information. Such analysis may help to re®ne
sequencing strategies for other large genomes.
Technology for large-scale sequencing
Sequencing the human genome depended on many technological
improvements in the production and analysis of sequence data. Key
innovations were developed both within and outside the Human
Genome Project. Laboratory innovations included four-colour
¯uorescence-based sequence detection
62
, improved ¯uorescent
dyes
63±66
, dye-labelled terminators
67
, polymerases speci®cally
designed for sequencing
68±70
, cycle sequencing
71
and capillary gel
electrophoresis
72±74
. These studies contributed to substantial
improvements in the automation, quality and throughput of
collecting raw DNA sequence
75,76
. There were also important
advances in the development of software packages for the analysis
of sequence data. The PHRED software package
77,78
introduced the
concept of assigning a `base-quality score' to each base, on the basis
of the probability of an erroneous call. These quality scores make it
possible to monitor raw data quality and also assist in determining
whether two similar sequences truly overlap. The PHRAP computer
package (http://bozeman.mbt.washington.edu/phrap.docs/phrap.
html) then systematically assembles the sequence data using the
base-quality scores. The program assigns `assembly-quality scores'
to each base in the assembled sequence, providing an objective
criterion to guide sequence ®nishing. The quality scores were based
on and validated by extensive experimental data.
Another key innovation for scaling up sequencing was the
development by several centres of automated methods for sample
preparation. This typically involved creating new biochemical
protocols suitable for automation, followed by construction of
appropriate robotic systems.
Coordination and public data sharing
The Human Genome Project adopted two important principles
with regard to human sequencing. The ®rst was that the collabora-
tion would be open to centres from any nation. Although potentially
less ef®cient, in a narrow economic sense, than a centralized
approach involving a few large factories, the inclusive approach
was strongly favoured because we felt that the human genome
sequence is the common heritage of all humanity and the work
should transcend national boundaries, and we believed that
scienti®c progress was best assured by a diversity of approaches.
The collaboration was coordinated through periodic international
meetings (referred to as `Bermuda meetings' after the venue of the
®rst three gatherings) and regular telephone conferences. Work was
shared ¯exibly among the centres, with some groups focusing on
particular chromosomes and others contributing in a genome-wide
fashion.
The second principle was rapid and unrestricted data release. The
centres adopted a policy that all genomic sequence data should be
made publicly available without restriction within 24 hours of
assembly
79,80
. Pre-publication data releases had been pioneered in
mapping projects in the worm
11
and mouse genomes
30,81
and were
prominently adopted in the sequencing of the worm, providing a
direct model for the human sequencing efforts. We believed that
scienti®c progress would be most rapidly advanced by immediate
and free availability of the human genome sequence. The explosion
of scienti®c work based on the publicly available sequence data in
both academia and industry has con®rmed this judgement.
Generating the draft genome sequence
Generating a draft sequence of the human genome involved three
steps: selecting the BAC clones to be sequenced, sequencing them
and assembling the individual sequenced clones into an overall draft
genome sequence. A glossary of terms related to genome sequencing
and assembly is provided in Box 1.
The draft genome sequence is a dynamic product, which is
regularly updated as additional data accumulate en route to the
articles
864 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
`hierarchical shotgun sequencing' approach (Fig. 2), also referred
to as `map-based', `BAC-based' or `clone-by-clone'. This approach
involves generating and organizing a set of large-insert clones
(typically 100±200 kb each) covering the genome and separately
performing shotgun sequencing on appropriately chosen clones.
Because the sequence information is local, the issue of long-range
misassembly is eliminated and the risk of short-range misassembly
is reduced. One caveat is that some large-insert clones may suffer
rearrangement, although this risk can be reduced by appropriate
quality-control measures involving clone ®ngerprints (see below).
The two methods are likely to entail similar costs for producing
®nished sequence of a mammalian genome. The hierarchical
approach has a higher initial cost than the whole-genome approach,
owing to the need to create a map of clones (about 1% of the total
cost of sequencing) and to sequence overlaps between clones. On
the other hand, the whole-genome approach is likely to require
much greater work and expense in the ®nal stage of producing a
®nished sequence, because of the challenge of resolving misassem-
blies. Both methods must also deal with cloning biases, resulting in
under-representation of some regions in either large-insert or
small-insert clone libraries.
There was lively scienti®c debate over whether the human
genome sequencing effort should employ whole-genome or hier-
archical shotgun sequencing. Weber and Myers
58
stimulated these
discussions with a speci®c proposal for a whole-genome shotgun
approach, together with an analysis suggesting that the method
could work and be more ef®cient. Green
59
challenged these conclu-
sions and argued that the potential bene®ts did not outweigh the
likely risks.
In the end, we concluded that the human genome sequencing
effort should employ the hierarchical approach for several reasons.
First, it was prudent to use the approach for the ®rst project to
sequence a repeat-rich genome. With the hierarchical approach, the
ultimate frequency of misassembly in the ®nished product would
probably be lower than with the whole-genome approach, in which
it would be more dif®cult to identify regions in which the assembly
was incorrect.
Second, it was prudent to use the approach in dealing with an
outbred organism, such as the human. In the whole-genome shot-
gun method, sequence would necessarily come from two different
copies of the human genome. Accurate sequence assembly could be
complicated by sequence variation between these two copiesÐboth
SNPs (which occur at a rate of 1 per 1,300 bases) and larger-scale
structural heterozygosity (which has been documented in human
chromosomes). In the hierarchical shotgun method, each large-
insert clone is derived from a single haplotype.
Third, the hierarchical method would be better able to deal with
inevitable cloning biases, because it would more readily allow
targeting of additional sequencing to under-represented regions.
And fourth, it was better suited to a project shared among members
of a diverse international consortium, because it allowed work and
responsibility to be easily distributed. As the ultimate goal has
always been to create a high-quality, ®nished sequence to serve as a
foundation for biomedical research, we reasoned that the advan-
tages of this more conservative approach outweighed the additional
cost, if any.
A biotechnology company, Celera Genomics, has chosen to
incorporate the whole-genome shotgun approach into its own
efforts to sequence the human genome. Their plan
60,61
uses a
mixed strategy, involving combining some coverage with whole-
genome shotgun data generated by the company together with the
publicly available hierarchical shotgun data generated by the Inter-
national Human Genome Sequencing Consortium. If the raw
sequence reads from the whole-genome shotgun component are
made available, it may be possible to evaluate the extent to which the
sequence of the human genome can be assembled without the need
for clone-based information. Such analysis may help to re®ne
sequencing strategies for other large genomes.
Technology for large-scale sequencing
Sequencing the human genome depended on many technological
improvements in the production and analysis of sequence data. Key
innovations were developed both within and outside the Human
Genome Project. Laboratory innovations included four-colour
¯uorescence-based sequence detection
62
, improved ¯uorescent
dyes
63±66
, dye-labelled terminators
67
, polymerases speci®cally
designed for sequencing
68±70
, cycle sequencing
71
and capillary gel
electrophoresis
72±74
. These studies contributed to substantial
improvements in the automation, quality and throughput of
collecting raw DNA sequence
75,76
. There were also important
advances in the development of software packages for the analysis
of sequence data. The PHRED software package
77,78
introduced the
concept of assigning a `base-quality score' to each base, on the basis
of the probability of an erroneous call. These quality scores make it
possible to monitor raw data quality and also assist in determining
whether two similar sequences truly overlap. The PHRAP computer
package (http://bozeman.mbt.washington.edu/phrap.docs/phrap.
html) then systematically assembles the sequence data using the
base-quality scores. The program assigns `assembly-quality scores'
to each base in the assembled sequence, providing an objective
criterion to guide sequence ®nishing. The quality scores were based
on and validated by extensive experimental data.
Another key innovation for scaling up sequencing was the
development by several centres of automated methods for sample
preparation. This typically involved creating new biochemical
protocols suitable for automation, followed by construction of
appropriate robotic systems.
Coordination and public data sharing
The Human Genome Project adopted two important principles
with regard to human sequencing. The ®rst was that the collabora-
tion would be open to centres from any nation. Although potentially
less ef®cient, in a narrow economic sense, than a centralized
approach involving a few large factories, the inclusive approach
was strongly favoured because we felt that the human genome
sequence is the common heritage of all humanity and the work
should transcend national boundaries, and we believed that
scienti®c progress was best assured by a diversity of approaches.
The collaboration was coordinated through periodic international
meetings (referred to as `Bermuda meetings' after the venue of the
®rst three gatherings) and regular telephone conferences. Work was
shared ¯exibly among the centres, with some groups focusing on
particular chromosomes and others contributing in a genome-wide
fashion.
The second principle was rapid and unrestricted data release. The
centres adopted a policy that all genomic sequence data should be
made publicly available without restriction within 24 hours of
assembly
79,80
. Pre-publication data releases had been pioneered in
mapping projects in the worm
11
and mouse genomes
30,81
and were
prominently adopted in the sequencing of the worm, providing a
direct model for the human sequencing efforts. We believed that
scienti®c progress would be most rapidly advanced by immediate
and free availability of the human genome sequence. The explosion
of scienti®c work based on the publicly available sequence data in
both academia and industry has con®rmed this judgement.
Generating the draft genome sequence
Generating a draft sequence of the human genome involved three
steps: selecting the BAC clones to be sequenced, sequencing them
and assembling the individual sequenced clones into an overall draft
genome sequence. A glossary of terms related to genome sequencing
and assembly is provided in Box 1.
The draft genome sequence is a dynamic product, which is
regularly updated as additional data accumulate en route to the
articles
864 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
Page 6
ultimate goal of a completely ®nished sequence. The results below
are based on the map and sequence data available on 7 October
2000, except as otherwise noted. At the end of this section, we
provide a brief update of key data.
Clone selection
The hierarchical shotgun method involves the sequencing of over-
lapping large-insert clones spanning the genome. For the Human
Genome Project, clones were largely chosen from eight large-insert
libraries containing BAC or P1-derived arti®cial chromosome
(PAC) clones (Table 1; refs 82±88). The libraries were made by
partial digestion of genomic DNA with restriction enzymes.
Together, they represent around 65-fold coverage (redundant sam-
pling) of the genome. Libraries based on other vectors, such as
cosmids, were also used in early stages of the project.
The libraries (Table 1) were prepared from DNA obtained from
anonymous human donors in accordance with US Federal Regu-
lations for the Protection of Human Subjects in Research
(45CFR46) and following full review by an Institutional Review
Board. Brie¯y, the opportunity to donate DNA for this purpose was
broadly advertised near the two laboratories engaged in library
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 865
Box 1
Genome glossary
Sequence
Raw sequence Individual unassembled sequence reads, produced
by sequencing of clones containing DNA inserts.
Paired-end sequence Raw sequence obtained from both ends of a
cloned insert in any vector, such as a plasmid or bacterial arti®cial
chromosome.
Finished sequence Complete sequence of a clone or genome, with
an accuracy of at least 99.99% and no gaps.
Coverage (or depth) The average number of times a nucleotide is
represented by a high-quality base in a collection of random raw
sequence. Operationally, a `high-quality base' is de®ned as one with an
accuracy of at least 99% (corresponding to a PHRED score of at least 20).
Full shotgun coverage The coverage in random raw sequence
needed from a large-insert clone to ensure that it is ready for ®nishing; this
varies among centres but is typically 8±10-fold. Clones with full shotgun
coverage can usually be assembled with only a handful of gaps per
100 kb.
Half shotgun coverage Half the amount of full shotgun coverage
(typically, 4±5-fold random coverage).
Clones
BAC clone Bacterial arti®cial chromosome vector carrying a genomic
DNA insert, typically 100±200 kb. Most of the large-insert clones
sequenced in the project were BAC clones.
Finished clone A large-insert clone that is entirely represented by
®nished sequence.
Full shotgun clone A large-insert clone for which full shotgun
sequence has been produced.
Draft clone A large-insert clone for which roughly half-shotgun
sequence has been produced. Operationally, the collection of draft
clones produced by each centre was required to have an average
coverage of fourfold for the entire set and a minimum coverage of
threefold for each clone.
Predraft clone A large-insert clone for which some shotgun
sequence is available, but which does not meet the standards for
inclusion in the collection of draft clones.
Contigs and scaffolds
Contig The result of joining an overlapping collection of sequences or
clones.
Scaffold The result of connecting contigs by linking information from
paired-end reads from plasmids, paired-end reads from BACs, known
messenger RNAs or other sources. The contigs in a scaffold are ordered
and oriented with respect to one another.
Fingerprint clone contigs Contigs produced by joining clones
inferred to overlap on the basis of their restriction digest ®ngerprints.
Sequenced-clone layout Assignment of sequenced clones to the
physical map of ®ngerprint clone contigs.
Initial sequence contigs Contigs produced by merging over-
lapping sequence reads obtained from a single clone, in a process called
sequence assembly.
Merged sequence contigs Contigs produced by taking the initial
sequence contigs contained in overlapping clones and merging those
found to overlap. These are also referred to simply as `sequence contigs'
where no confusion will result.
Sequence-contig scaffolds Scaffolds produced by connecting
sequence contigs on the basis of linking information.
Sequenced-clone contigs Contigs produced by merging over-
lapping sequenced clones.
Sequenced-clone-contig scaffolds Scaffolds produced by join-
ing sequenced-clone contigs on the basis of linking information.
Draft genome sequence The sequence produced by combining
the information from the individual sequenced clones (by creating
merged sequence contigs and then employing linking information to
create scaffolds) and positioning the sequence along the physical map of
the chromosomes.
N50 length A measure of the contig length (or scaffold length)
containing a `typical' nucleotide. Speci®cally, it is the maximum length L
such that 50%ofall nucleotides lie incontigs (or scaffolds) of size at leastL.
Computer programs and databases
PHRED A widely used computer program that analyses raw sequence
to produce a `base call' with an associated `quality score' for each
position in the sequence. A PHRED quality score of X corresponds to an
error probability of approximately 10
- X/10
. Thus, a PHRED quality score of
30 corresponds to 99.9% accuracy for the base call in the raw read.
PHRAP A widely used computer program that assembles raw
sequence into sequence contigs and assigns to each position in the
sequence an associated `quality score', on the basis of the PHRED
scores of the raw sequence reads. A PHRAP quality score of X
corresponds to an error probability of approximately 10
- X/10
. Thus, a
PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in
the assembled sequence.
GigAssembler A computer program developed during this project
for merging the information from individual sequenced clones into a draft
genome sequence.
Public sequence databases The three coordinated international
sequence databases: GenBank, the EMBL data library and DDBJ.
Map features
STS Sequence tagged site, corresponding to a short (typically less
than 500 bp) unique genomic locus for which a polymerase chain
reaction assay has been developed.
EST Expressed sequence tag, obtained by performing a single raw
sequence read from a random complementary DNA clone.
SSR Simple sequence repeat, a sequence consisting largely of a
tandem repeat of a speci®c k-mer (such as (CA)
15
). Many SSRs are
polymorphic and have been widely used in genetic mapping.
SNP Single nucleotide polymorphism, or a single nucleotide position in
the genome sequence for which two or more alternative alleles are
present at appreciable frequency (traditionally, at least 1%) in the human
population.
Genetic map A genome map in which polymorphic loci are
positioned relative to one another on the basis of the frequency with
which they recombine during meiosis. The unit of distance is
centimorgans (cM), denoting a 1% chance of recombination.
Radiation hybrid (RH) map A genome map in which STSs are
positioned relative to one another on the basis of the frequency with
which they are separated by radiation-induced breaks. The frequency is
assayed by analysing a panel of human±hamster hybrid cell lines, each
produced by lethally irradiating human cells and fusing them with
recipient hamster cells such that each carries a collection of human
chromosomal fragments. The unit of distance is centirays (cR), denoting
a 1% chance of a break occuring between two loci.
© 2001 Macmillan Magazines Ltd
are based on the map and sequence data available on 7 October
2000, except as otherwise noted. At the end of this section, we
provide a brief update of key data.
Clone selection
The hierarchical shotgun method involves the sequencing of over-
lapping large-insert clones spanning the genome. For the Human
Genome Project, clones were largely chosen from eight large-insert
libraries containing BAC or P1-derived arti®cial chromosome
(PAC) clones (Table 1; refs 82±88). The libraries were made by
partial digestion of genomic DNA with restriction enzymes.
Together, they represent around 65-fold coverage (redundant sam-
pling) of the genome. Libraries based on other vectors, such as
cosmids, were also used in early stages of the project.
The libraries (Table 1) were prepared from DNA obtained from
anonymous human donors in accordance with US Federal Regu-
lations for the Protection of Human Subjects in Research
(45CFR46) and following full review by an Institutional Review
Board. Brie¯y, the opportunity to donate DNA for this purpose was
broadly advertised near the two laboratories engaged in library
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 865
Box 1
Genome glossary
Sequence
Raw sequence Individual unassembled sequence reads, produced
by sequencing of clones containing DNA inserts.
Paired-end sequence Raw sequence obtained from both ends of a
cloned insert in any vector, such as a plasmid or bacterial arti®cial
chromosome.
Finished sequence Complete sequence of a clone or genome, with
an accuracy of at least 99.99% and no gaps.
Coverage (or depth) The average number of times a nucleotide is
represented by a high-quality base in a collection of random raw
sequence. Operationally, a `high-quality base' is de®ned as one with an
accuracy of at least 99% (corresponding to a PHRED score of at least 20).
Full shotgun coverage The coverage in random raw sequence
needed from a large-insert clone to ensure that it is ready for ®nishing; this
varies among centres but is typically 8±10-fold. Clones with full shotgun
coverage can usually be assembled with only a handful of gaps per
100 kb.
Half shotgun coverage Half the amount of full shotgun coverage
(typically, 4±5-fold random coverage).
Clones
BAC clone Bacterial arti®cial chromosome vector carrying a genomic
DNA insert, typically 100±200 kb. Most of the large-insert clones
sequenced in the project were BAC clones.
Finished clone A large-insert clone that is entirely represented by
®nished sequence.
Full shotgun clone A large-insert clone for which full shotgun
sequence has been produced.
Draft clone A large-insert clone for which roughly half-shotgun
sequence has been produced. Operationally, the collection of draft
clones produced by each centre was required to have an average
coverage of fourfold for the entire set and a minimum coverage of
threefold for each clone.
Predraft clone A large-insert clone for which some shotgun
sequence is available, but which does not meet the standards for
inclusion in the collection of draft clones.
Contigs and scaffolds
Contig The result of joining an overlapping collection of sequences or
clones.
Scaffold The result of connecting contigs by linking information from
paired-end reads from plasmids, paired-end reads from BACs, known
messenger RNAs or other sources. The contigs in a scaffold are ordered
and oriented with respect to one another.
Fingerprint clone contigs Contigs produced by joining clones
inferred to overlap on the basis of their restriction digest ®ngerprints.
Sequenced-clone layout Assignment of sequenced clones to the
physical map of ®ngerprint clone contigs.
Initial sequence contigs Contigs produced by merging over-
lapping sequence reads obtained from a single clone, in a process called
sequence assembly.
Merged sequence contigs Contigs produced by taking the initial
sequence contigs contained in overlapping clones and merging those
found to overlap. These are also referred to simply as `sequence contigs'
where no confusion will result.
Sequence-contig scaffolds Scaffolds produced by connecting
sequence contigs on the basis of linking information.
Sequenced-clone contigs Contigs produced by merging over-
lapping sequenced clones.
Sequenced-clone-contig scaffolds Scaffolds produced by join-
ing sequenced-clone contigs on the basis of linking information.
Draft genome sequence The sequence produced by combining
the information from the individual sequenced clones (by creating
merged sequence contigs and then employing linking information to
create scaffolds) and positioning the sequence along the physical map of
the chromosomes.
N50 length A measure of the contig length (or scaffold length)
containing a `typical' nucleotide. Speci®cally, it is the maximum length L
such that 50%ofall nucleotides lie incontigs (or scaffolds) of size at leastL.
Computer programs and databases
PHRED A widely used computer program that analyses raw sequence
to produce a `base call' with an associated `quality score' for each
position in the sequence. A PHRED quality score of X corresponds to an
error probability of approximately 10
- X/10
. Thus, a PHRED quality score of
30 corresponds to 99.9% accuracy for the base call in the raw read.
PHRAP A widely used computer program that assembles raw
sequence into sequence contigs and assigns to each position in the
sequence an associated `quality score', on the basis of the PHRED
scores of the raw sequence reads. A PHRAP quality score of X
corresponds to an error probability of approximately 10
- X/10
. Thus, a
PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in
the assembled sequence.
GigAssembler A computer program developed during this project
for merging the information from individual sequenced clones into a draft
genome sequence.
Public sequence databases The three coordinated international
sequence databases: GenBank, the EMBL data library and DDBJ.
Map features
STS Sequence tagged site, corresponding to a short (typically less
than 500 bp) unique genomic locus for which a polymerase chain
reaction assay has been developed.
EST Expressed sequence tag, obtained by performing a single raw
sequence read from a random complementary DNA clone.
SSR Simple sequence repeat, a sequence consisting largely of a
tandem repeat of a speci®c k-mer (such as (CA)
15
). Many SSRs are
polymorphic and have been widely used in genetic mapping.
SNP Single nucleotide polymorphism, or a single nucleotide position in
the genome sequence for which two or more alternative alleles are
present at appreciable frequency (traditionally, at least 1%) in the human
population.
Genetic map A genome map in which polymorphic loci are
positioned relative to one another on the basis of the frequency with
which they recombine during meiosis. The unit of distance is
centimorgans (cM), denoting a 1% chance of recombination.
Radiation hybrid (RH) map A genome map in which STSs are
positioned relative to one another on the basis of the frequency with
which they are separated by radiation-induced breaks. The frequency is
assayed by analysing a panel of human±hamster hybrid cell lines, each
produced by lethally irradiating human cells and fusing them with
recipient hamster cells such that each carries a collection of human
chromosomal fragments. The unit of distance is centirays (cR), denoting
a 1% chance of a break occuring between two loci.
© 2001 Macmillan Magazines Ltd
Page 7
construction. Volunteers of diverse backgrounds were accepted on a
®rst-come, ®rst-taken basis. Samples were obtained after discussion
with a genetic counsellor and written informed consent. The
samples were made anonymous as follows: the sampling laboratory
stripped all identi®ers from the samples, applied random numeric
labels, and transferred them to the processing laboratory, which
then removed all labels and relabelled the samples. All records of the
labelling were destroyed. The processing laboratory chose samples
at random from which to prepare DNA and immortalized cell lines.
Around 5±10 samples were collected for every one that was
eventually used. Because no link was retained between donor and
DNA sample, the identity of the donors for the libraries is not
known, even by the donors themselves. A more complete descrip-
tion can be found at http://www.nhgri.nih.gov/Grant_info/Fund-
ing/Statements/RFA/human_subjects.html.
During the pilot phase, centres showed that sequence-tagged sites
(STSs) from previously constructed genetic and physical maps
could be used to recover BACs from speci®c regions. As sequencing
expanded, some centres continued this approach, augmented with
additional probes from ¯ow sorting of chromosomes to obtain
long-range coverage of speci®c chromosomes or chromosomal
regions
89±94
.
For the large-scale sequence production phase, a genome-wide
physical map of overlapping clones was also constructed by sys-
tematic analysis of BAC clones representing 20-fold coverage of the
human genome
86
. Most clones came from the ®rst three sections of
the RPCI-11 library, supplemented with clones from sections of the
RPCI-13 and CalTech D libraries (Table 1). DNA from each BAC
clone was digested with the restriction enzyme HindIII, and the sizes
of the resulting fragments were measured by agarose gel electro-
phoresis. The pattern of restriction fragments provides a `®nger-
print' for each BAC, which allows different BACs to be distinguished
and the degree of overlaps to be assessed. We used these restriction-
fragment ®ngerprints to determine clone overlaps, and thereby
assembled the BACs into ®ngerprint clone contigs.
The ®ngerprint clone contigs were positioned along the chromo-
somes by anchoring them with STS markers from existing genetic
and physical maps. Fingerprint clone contigs were tied to speci®c
STSs initially by probe hybridization and later by direct search of the
sequenced clones. To localize ®ngerprint clone contigs that did not
contain known markers, new STSs were generated and placed onto
chromosomes
95
. Representative clones were also positioned by ¯uor-
escence in situ hybridization (FISH) (ref. 86 and C. McPherson,
unpublished).
We selected clones from the ®ngerprint clone contigs for sequen-
cing according to various criteria. Fingerprint data were
reviewed
86,90
to evaluate overlaps and to assess clone ®delity (to
bias against rearranged clones
83,96
). STS content information and
BAC end sequence information were also used
91,92
. Where possible,
we tried to select a minimally overlapping set spanning a region.
However, because the genome-wide physical map was constructed
concurrently with the sequencing, continuity in many regions was
low in early stages. These small ®ngerprint clone contigs were
nonetheless useful in identifying validated, nonredundant clones
articles
866 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 1 Key large-insert genome-wide libraries
Library name* GenBank
abbreviation
Vector
type
Source DNA Library
segment or
plate
numbers
Enzyme
digest
Average
insert size
(kb)
Total number
of clones in
library
Number of
®ngerprinted
clones²
BAC-end
sequence
(ends/clones/
clones with
both ends
sequenced)³
Number of
clones in
genome
layout§
Sequenced clones used in
construction of the draft genome
sequence
Numberk Total bases
(Mb)¶
Fraction of
total from
library
Caltech B CTB BAC 987SK cells All HindIII 120 74,496 16 2/1/1 528 518 66.7 0.016
Caltech C CTC BAC Human
sperm
All HindIII 125 263,040 144 21,956/
14,445/
7,255
621 606 88.4 0.021
Caltech D1
(CITB-H1)
CTD BAC Human
sperm
All HindIII 129 162,432 49,833 403,589/
226,068/
156,631
1,381 1,367 185.6 0.043
Caltech D2
(CITB-E1)
BAC Human
sperm
All
2,501±2,565 EcoRI 202 24,960
2,566±2,671 EcoRI 182 46,326
3,000±3,253 EcoRI 142 97,536
RPCI-1 RP1 PAC Male, blood All MboI 110 115,200 3,388 1,070 1,053 117.7 0.028
RPCI-3 RP3 PAC Male, blood All MboI 115 75,513 644 638 68.5 0.016
RPCI-4 RP4 PAC Male, blood All MboI 116 105,251 889 881 95.5 0.022
RPCI-5 RP5 PAC Male, blood All MboI 115 142,773 1,042 1,033 116.5 0.027
RPCI-11 RP11 BAC Male, blood All 178 543,797 267,931 379,773/
243,764/
134,110
19,405 19,145 3,165.0 0.743
1 EcoRI 164 108,499
2 EcoRI 168 109,496
3 EcoRI 181 109,657
4 EcoRI 183 109,382
5 MboI 196 106,763
Total of top
eight libraries
1,482,502 321,312 805,320/
484,278/
297,997
25,580 25,241 3,903.9 0.916
Total all libraries 354,510 812,594/
488,017/
100,775
30,445 29,298 4,260.5 1
...................................................................................................................................................................................................................................................................................................................................................................
* For the CalTech libraries
82
, see http://www.tree.caltech.edu/lib_status.html; for RPCI libraries
83
, see http://www.chori.org/bacpac/home.htm.
² For the FPC map and ®ngerprinting
84±86
, see http://genome.wustl.edu/gsc/human/human_database.shtml.
³ The number of raw BAC end sequences (clones/ends/clones with both ends sequenced) available for use in human genome sequencing. Typically, for clones in which sequence was obtained from both
ends, more than 95% of both end sequences contained at least 100 bp of nonrepetitive sequence. BAC-end sequencing of RPCI-11 and of the CalTech libraries was done at The Institute for Genomic
Research, the California Institute of Technology and the University of Washington High Throughput Sequencing Center. The sources for the Table were http://www.ncbi.nlm.nih.gov/genome/clone/
BESstat.shtml and refs 87, 88.
§ These are the clones in the sequenced-clone layout map (http://genome.wustl.edu/gsc/human/Mapping/index.shtml) that were pre-draft, draft or ®nished.
k The number of sequenced clones used in the assembly. This number is less than that in the previous column owing to removal of a small number of obviously contaminated, combined or duplicated
projects; in addition, not all of the clones from completed chromosomes 21 and 22 were included here because only the available ®nished sequence from those chromosomes was used in the assembly.
¶ The number reported is the total sequence from the clones indicated in the previous column. Potential overlap between clones was not removed here, but Ns were excluded.
© 2001 Macmillan Magazines Ltd
®rst-come, ®rst-taken basis. Samples were obtained after discussion
with a genetic counsellor and written informed consent. The
samples were made anonymous as follows: the sampling laboratory
stripped all identi®ers from the samples, applied random numeric
labels, and transferred them to the processing laboratory, which
then removed all labels and relabelled the samples. All records of the
labelling were destroyed. The processing laboratory chose samples
at random from which to prepare DNA and immortalized cell lines.
Around 5±10 samples were collected for every one that was
eventually used. Because no link was retained between donor and
DNA sample, the identity of the donors for the libraries is not
known, even by the donors themselves. A more complete descrip-
tion can be found at http://www.nhgri.nih.gov/Grant_info/Fund-
ing/Statements/RFA/human_subjects.html.
During the pilot phase, centres showed that sequence-tagged sites
(STSs) from previously constructed genetic and physical maps
could be used to recover BACs from speci®c regions. As sequencing
expanded, some centres continued this approach, augmented with
additional probes from ¯ow sorting of chromosomes to obtain
long-range coverage of speci®c chromosomes or chromosomal
regions
89±94
.
For the large-scale sequence production phase, a genome-wide
physical map of overlapping clones was also constructed by sys-
tematic analysis of BAC clones representing 20-fold coverage of the
human genome
86
. Most clones came from the ®rst three sections of
the RPCI-11 library, supplemented with clones from sections of the
RPCI-13 and CalTech D libraries (Table 1). DNA from each BAC
clone was digested with the restriction enzyme HindIII, and the sizes
of the resulting fragments were measured by agarose gel electro-
phoresis. The pattern of restriction fragments provides a `®nger-
print' for each BAC, which allows different BACs to be distinguished
and the degree of overlaps to be assessed. We used these restriction-
fragment ®ngerprints to determine clone overlaps, and thereby
assembled the BACs into ®ngerprint clone contigs.
The ®ngerprint clone contigs were positioned along the chromo-
somes by anchoring them with STS markers from existing genetic
and physical maps. Fingerprint clone contigs were tied to speci®c
STSs initially by probe hybridization and later by direct search of the
sequenced clones. To localize ®ngerprint clone contigs that did not
contain known markers, new STSs were generated and placed onto
chromosomes
95
. Representative clones were also positioned by ¯uor-
escence in situ hybridization (FISH) (ref. 86 and C. McPherson,
unpublished).
We selected clones from the ®ngerprint clone contigs for sequen-
cing according to various criteria. Fingerprint data were
reviewed
86,90
to evaluate overlaps and to assess clone ®delity (to
bias against rearranged clones
83,96
). STS content information and
BAC end sequence information were also used
91,92
. Where possible,
we tried to select a minimally overlapping set spanning a region.
However, because the genome-wide physical map was constructed
concurrently with the sequencing, continuity in many regions was
low in early stages. These small ®ngerprint clone contigs were
nonetheless useful in identifying validated, nonredundant clones
articles
866 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 1 Key large-insert genome-wide libraries
Library name* GenBank
abbreviation
Vector
type
Source DNA Library
segment or
plate
numbers
Enzyme
digest
Average
insert size
(kb)
Total number
of clones in
library
Number of
®ngerprinted
clones²
BAC-end
sequence
(ends/clones/
clones with
both ends
sequenced)³
Number of
clones in
genome
layout§
Sequenced clones used in
construction of the draft genome
sequence
Numberk Total bases
(Mb)¶
Fraction of
total from
library
Caltech B CTB BAC 987SK cells All HindIII 120 74,496 16 2/1/1 528 518 66.7 0.016
Caltech C CTC BAC Human
sperm
All HindIII 125 263,040 144 21,956/
14,445/
7,255
621 606 88.4 0.021
Caltech D1
(CITB-H1)
CTD BAC Human
sperm
All HindIII 129 162,432 49,833 403,589/
226,068/
156,631
1,381 1,367 185.6 0.043
Caltech D2
(CITB-E1)
BAC Human
sperm
All
2,501±2,565 EcoRI 202 24,960
2,566±2,671 EcoRI 182 46,326
3,000±3,253 EcoRI 142 97,536
RPCI-1 RP1 PAC Male, blood All MboI 110 115,200 3,388 1,070 1,053 117.7 0.028
RPCI-3 RP3 PAC Male, blood All MboI 115 75,513 644 638 68.5 0.016
RPCI-4 RP4 PAC Male, blood All MboI 116 105,251 889 881 95.5 0.022
RPCI-5 RP5 PAC Male, blood All MboI 115 142,773 1,042 1,033 116.5 0.027
RPCI-11 RP11 BAC Male, blood All 178 543,797 267,931 379,773/
243,764/
134,110
19,405 19,145 3,165.0 0.743
1 EcoRI 164 108,499
2 EcoRI 168 109,496
3 EcoRI 181 109,657
4 EcoRI 183 109,382
5 MboI 196 106,763
Total of top
eight libraries
1,482,502 321,312 805,320/
484,278/
297,997
25,580 25,241 3,903.9 0.916
Total all libraries 354,510 812,594/
488,017/
100,775
30,445 29,298 4,260.5 1
...................................................................................................................................................................................................................................................................................................................................................................
* For the CalTech libraries
82
, see http://www.tree.caltech.edu/lib_status.html; for RPCI libraries
83
, see http://www.chori.org/bacpac/home.htm.
² For the FPC map and ®ngerprinting
84±86
, see http://genome.wustl.edu/gsc/human/human_database.shtml.
³ The number of raw BAC end sequences (clones/ends/clones with both ends sequenced) available for use in human genome sequencing. Typically, for clones in which sequence was obtained from both
ends, more than 95% of both end sequences contained at least 100 bp of nonrepetitive sequence. BAC-end sequencing of RPCI-11 and of the CalTech libraries was done at The Institute for Genomic
Research, the California Institute of Technology and the University of Washington High Throughput Sequencing Center. The sources for the Table were http://www.ncbi.nlm.nih.gov/genome/clone/
BESstat.shtml and refs 87, 88.
§ These are the clones in the sequenced-clone layout map (http://genome.wustl.edu/gsc/human/Mapping/index.shtml) that were pre-draft, draft or ®nished.
k The number of sequenced clones used in the assembly. This number is less than that in the previous column owing to removal of a small number of obviously contaminated, combined or duplicated
projects; in addition, not all of the clones from completed chromosomes 21 and 22 were included here because only the available ®nished sequence from those chromosomes was used in the assembly.
¶ The number reported is the total sequence from the clones indicated in the previous column. Potential overlap between clones was not removed here, but Ns were excluded.
© 2001 Macmillan Magazines Ltd
Page 8
that were used to `seed' the sequencing of new regions. The small
®ngerprint clone contigs were extended or merged with others as
the map matured.
The clones that make up the draft genome sequence therefore do
not constitute a minimally overlapping setÐthere is overlap and
redundancy in places. The cost of using suboptimal overlaps was
justi®ed by the bene®t of earlier availability of the draft genome
sequence data. Minimizing the overlap between adjacent clones
would have required completing the physical map before under-
taking large-scale sequencing. In addition, the overlaps between
BAC clones provide a rich collection of SNPs. More than 1.4 million
SNPs have already been identi®ed from clone overlaps and other
sequence comparisons
97
.
Because the sequencing project was shared among twenty centres
in six countries, it was important to coordinate selection of clones
across the centres. Most centres focused on particular chromosomes
or, in some cases, larger regions of the genome. We also maintained
a clone registry to track selected clones and their progress. In later
phases, the global map provided an integrated view of the data from
all centres, facilitating the distribution of effort to maximize cover-
age of the genome. Before performing extensive sequencing on a
clone, several centres routinely examined an initial sample of 96 raw
sequence reads from each subclone library to evaluate possible
overlap with previously sequenced clones.
Sequencing
The selected clones were subjected to shotgun sequencing. Although
the basic approach of shotgun sequencing is well established, the
details of implementation varied among the centres. For example,
there were differences in the average insert size of the shotgun
libraries, in the use of single-stranded or double-stranded cloning
vectors, and in sequencing from one end or both ends of each insert.
Centres differed in the ¯uorescent labels employed and in the degree
to which they used dye-primers or dye-terminators. The sequence
detectors included both slab gel- and capillary-based devices.
Detailed protocols are available on the web sites of many of the
individual centres (URLs can be found at www.nhgri.nih.gov/
genome_hub). The extent of automation also varied greatly
among the centres, with the most aggressive automation efforts
resulting in factory-style systems able to process more than 100,000
sequencing reactions in 12 hours (Fig. 3). In addition, centres
differed in the amount of raw sequence data typically obtained for
each clone (so-called half-shotgun, full shotgun and ®nished
sequence). Sequence information from the different centres could
be directly integrated despite this diversity, because the data were
analysed by a common computational procedure. Raw sequence
traces were processed and assembled with the PHRED and PHRAP
software packages
77,78
(P. Green, unpublished). All assembled con-
tigs of more than 2 kb were deposited in public databases within
24 hours of assembly.
The overall sequencing output rose sharply during production
(Fig. 4). Following installation of new sequence detectors beginning
in June 1999, sequencing capacity and output rose approximately
eightfold in eight months to nearly 7 million samples processed per
month, with little or no drop in success rate (ratio of useable reads
to attempted reads). By June 2000, the centres were producing raw
sequence at a rate equivalent to onefold coverage of the entire
human genome in less than six weeks. This corresponded to a
continuous throughput exceeding 1,000 nucleotides per second,
24 hours per day, seven days per week. This scale-up resulted in a
concomitant increase in the sequence available in the public
databases (Fig. 4).
A version of the draft genome sequence was prepared on the basis
of the map and sequence data available on 7 October 2000. For this
version, the mapping effort had assembled the ®ngerprinted BACs
into 1,246 ®ngerprint clone contigs. The sequencing effort had
sequenced and assembled 29,298 overlapping BACs and other large-
insert clones (Table 2), comprising a total length of 4.26 gigabases
(Gb). This resulted from around 23 Gb of underlying raw shotgun
sequence data, or about 7.5-fold coverage averaged across the
genome (including both draft and ®nished sequence). The various
contributions to the total amount of sequence deposited in the
HTGS division of GenBank are given in Table 3.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 867
Figure 3 The automated production line for sample preparation at the Whitehead
Institute, Center for Genome Research. The system consists of custom-designed factory-
style conveyor belt robots that perform all functions from purifying DNA from bacterial
cultures through setting up and purifying sequencing reactions.
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
Ja
n-
96
A
pr
-9
6
Ju
l-9
6
O
ct
-9
6
Ja
n-
97
A
pr
-9
7
Ju
l-9
7
O
ct
-9
7
Ja
n-
98
A
pr
-9
8
Ju
l-9
8
O
ct
-9
8
Ja
n-
99
A
pr
-9
9
Ju
l-9
9
O
ct
-9
9
Ja
n-
00
A
pr
-0
0
Ju
l-0
0
O
ct
-0
0
S
eq
ue
nc
e
(M
b)
Finished
Unfinished (draft and pre-draft)
Month
Figure 4 Total amount of human sequence in the High Throughput Genome Sequence
(HTGS) division of GenBank. The total is the sum of ®nished sequence (red) and un®nished
(draft plus predraft) sequence (yellow).
Table 2 Total genome sequence from the collection of sequenced clones, by
sequence status
Sequence
status
Number of
clones
Total clone
length (Mb)
Average
number of
sequence
reads per kb*
Average
sequence
depth²
Total amount
of raw
sequence (Mb)
Finished 8,277 897 20±25 8±12 9,085
Draft 18,969 3,097 12 4.5 13,395
Predraft 2,052 267 6 2.5 667
Total 23,147
.............................................................................................................................................................................
* The average number of reads per kb was estimated based on information provided by each
sequencing centre. This number differed among sequencing centres, based on the actual protocols
used.
² The average depth in high quality bases ($99% accuracy) was estimated from information
provided by each sequencing centre. The average varies among the centres, and the number may
vary considerably for clones with the same sequencing status. For draft clones in the public
databases (keyword: HTGS_draft), the number can be computed from the quality scores listed in
the database entry.
© 2001 Macmillan Magazines Ltd
®ngerprint clone contigs were extended or merged with others as
the map matured.
The clones that make up the draft genome sequence therefore do
not constitute a minimally overlapping setÐthere is overlap and
redundancy in places. The cost of using suboptimal overlaps was
justi®ed by the bene®t of earlier availability of the draft genome
sequence data. Minimizing the overlap between adjacent clones
would have required completing the physical map before under-
taking large-scale sequencing. In addition, the overlaps between
BAC clones provide a rich collection of SNPs. More than 1.4 million
SNPs have already been identi®ed from clone overlaps and other
sequence comparisons
97
.
Because the sequencing project was shared among twenty centres
in six countries, it was important to coordinate selection of clones
across the centres. Most centres focused on particular chromosomes
or, in some cases, larger regions of the genome. We also maintained
a clone registry to track selected clones and their progress. In later
phases, the global map provided an integrated view of the data from
all centres, facilitating the distribution of effort to maximize cover-
age of the genome. Before performing extensive sequencing on a
clone, several centres routinely examined an initial sample of 96 raw
sequence reads from each subclone library to evaluate possible
overlap with previously sequenced clones.
Sequencing
The selected clones were subjected to shotgun sequencing. Although
the basic approach of shotgun sequencing is well established, the
details of implementation varied among the centres. For example,
there were differences in the average insert size of the shotgun
libraries, in the use of single-stranded or double-stranded cloning
vectors, and in sequencing from one end or both ends of each insert.
Centres differed in the ¯uorescent labels employed and in the degree
to which they used dye-primers or dye-terminators. The sequence
detectors included both slab gel- and capillary-based devices.
Detailed protocols are available on the web sites of many of the
individual centres (URLs can be found at www.nhgri.nih.gov/
genome_hub). The extent of automation also varied greatly
among the centres, with the most aggressive automation efforts
resulting in factory-style systems able to process more than 100,000
sequencing reactions in 12 hours (Fig. 3). In addition, centres
differed in the amount of raw sequence data typically obtained for
each clone (so-called half-shotgun, full shotgun and ®nished
sequence). Sequence information from the different centres could
be directly integrated despite this diversity, because the data were
analysed by a common computational procedure. Raw sequence
traces were processed and assembled with the PHRED and PHRAP
software packages
77,78
(P. Green, unpublished). All assembled con-
tigs of more than 2 kb were deposited in public databases within
24 hours of assembly.
The overall sequencing output rose sharply during production
(Fig. 4). Following installation of new sequence detectors beginning
in June 1999, sequencing capacity and output rose approximately
eightfold in eight months to nearly 7 million samples processed per
month, with little or no drop in success rate (ratio of useable reads
to attempted reads). By June 2000, the centres were producing raw
sequence at a rate equivalent to onefold coverage of the entire
human genome in less than six weeks. This corresponded to a
continuous throughput exceeding 1,000 nucleotides per second,
24 hours per day, seven days per week. This scale-up resulted in a
concomitant increase in the sequence available in the public
databases (Fig. 4).
A version of the draft genome sequence was prepared on the basis
of the map and sequence data available on 7 October 2000. For this
version, the mapping effort had assembled the ®ngerprinted BACs
into 1,246 ®ngerprint clone contigs. The sequencing effort had
sequenced and assembled 29,298 overlapping BACs and other large-
insert clones (Table 2), comprising a total length of 4.26 gigabases
(Gb). This resulted from around 23 Gb of underlying raw shotgun
sequence data, or about 7.5-fold coverage averaged across the
genome (including both draft and ®nished sequence). The various
contributions to the total amount of sequence deposited in the
HTGS division of GenBank are given in Table 3.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 867
Figure 3 The automated production line for sample preparation at the Whitehead
Institute, Center for Genome Research. The system consists of custom-designed factory-
style conveyor belt robots that perform all functions from purifying DNA from bacterial
cultures through setting up and purifying sequencing reactions.
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
Ja
n-
96
A
pr
-9
6
Ju
l-9
6
O
ct
-9
6
Ja
n-
97
A
pr
-9
7
Ju
l-9
7
O
ct
-9
7
Ja
n-
98
A
pr
-9
8
Ju
l-9
8
O
ct
-9
8
Ja
n-
99
A
pr
-9
9
Ju
l-9
9
O
ct
-9
9
Ja
n-
00
A
pr
-0
0
Ju
l-0
0
O
ct
-0
0
S
eq
ue
nc
e
(M
b)
Finished
Unfinished (draft and pre-draft)
Month
Figure 4 Total amount of human sequence in the High Throughput Genome Sequence
(HTGS) division of GenBank. The total is the sum of ®nished sequence (red) and un®nished
(draft plus predraft) sequence (yellow).
Table 2 Total genome sequence from the collection of sequenced clones, by
sequence status
Sequence
status
Number of
clones
Total clone
length (Mb)
Average
number of
sequence
reads per kb*
Average
sequence
depth²
Total amount
of raw
sequence (Mb)
Finished 8,277 897 20±25 8±12 9,085
Draft 18,969 3,097 12 4.5 13,395
Predraft 2,052 267 6 2.5 667
Total 23,147
.............................................................................................................................................................................
* The average number of reads per kb was estimated based on information provided by each
sequencing centre. This number differed among sequencing centres, based on the actual protocols
used.
² The average depth in high quality bases ($99% accuracy) was estimated from information
provided by each sequencing centre. The average varies among the centres, and the number may
vary considerably for clones with the same sequencing status. For draft clones in the public
databases (keyword: HTGS_draft), the number can be computed from the quality scores listed in
the database entry.
© 2001 Macmillan Magazines Ltd
Page 9
By agreement among the centres, the collection of draft clones
produced by each centre was required to have fourfold average
sequence coverage, with no clone below threefold. (For this pur-
pose, sequence coverage was de®ned as the average number of times
that each base was independently read with a base-quality score
corresponding to at least 99% accuracy.) We attained an overall
average of 4.5-fold coverage across the genome for draft clones. A
few of the sequenced clones fell below the minimum of threefold
sequence coverage or have not been formally designated by centres
as meeting draft standards; these are referred to as predraft (Table 2).
Some of these are clones that span remaining gaps in the draft
genome sequence and were in the process of being sequenced on 7
October 2000; a few are old submissions from centres that are no
longer active.
The lengths of the initial sequence contigs in the draft clones vary
as a function of coverage, but half of all nucleotides reside in initial
sequence contigs of at least 21.7 kb (see below). Various properties
of the draft clones can be assessed from instances in which there was
substantial overlap between a draft clone and a ®nished (or nearly
®nished) clone. By examining the sequence alignments in the
overlap regions, we estimated that the initial sequence contigs in a
draft sequence clone cover an average of about 96% of the clone and
are separated by gaps with an average size of about 500 bp.
Although the main emphasis was on producing a draft genome
sequence, the centres also maintained sequence ®nishing activities
during this period, leading to a twofold increase in ®nished
sequence from June 1999 to June 2000 (Fig. 4). The total amount
of human sequence in this ®nal form stood at more than 835 Mb on
7 October 2000, or more than 25% of the human genome. This
includes the ®nished sequences of chromosomes 21 and 22 (refs 93,
94). As centres have begun to shift from draft to ®nished sequencing
in the last quarter of 2000, the production of ®nished sequence has
increased to an annualized rate of 1 Gb per year and is continuing to
rise.
In addition to sequencing large-insert clones, three centres
generated a large collection of random raw sequence reads from
whole-genome shotgun libraries (Table 4; ref. 98). These 5.77
million successful sequences contained 2.4 Gb of high-quality
bases; this corresponds to about 0.75-fold coverage and would be
statistically expected to include about 50% of the nucleotides in the
human genome (data available at http://snp.cshl.org/data). The
primary objective of this work was to discover SNPs, by comparing
these random raw sequences (which came from different individ-
uals) with the draft genome sequence. However, many of these raw
sequences were obtained from both ends of plasmid clones and
thereby also provided valuable `linking' information that was used
in sequence assembly. In addition, the random raw sequences
provide sequence coverage of about half of the nucleotides not yet
represented in the sequenced large-insert clones; these can be used
as probes for portions of the genome not yet recovered.
Assembly of the draft genome sequence
We then set out to assemble the sequences from the individual large-
insert clones into an integrated draft sequence of the human
genome. The assembly process had to resolve problems arising
from the draft nature of much of the sequence, from the variety of
clone sources, and from the high fraction of repeated sequences in
the human genome. This process involved three steps: ®ltering,
layout and merging.
The entire data set was ®ltered uniformly to eliminate contam-
ination from nonhuman sequences and other artefacts that had not
already been removed by the individual centres. (Information about
contamination was also sent back to the centres, which are updating
the individual entries in the public databases.) We also identi®ed
instances in which the sequence data from one BAC clone was
substantially contaminated with sequence data from another
(human or nonhuman) clone. The problems were resolved in
most instances; 231 clones remained unresolved, and these were
eliminated from the assembly reported here. Instances of lower
levels of cross-contamination (for example, a single 96-well micro-
plate misassigned to the wrong BAC) are more dif®cult to detect;
some undoubtedly remain and may give rise to small spurious
sequence contigs in the draft genome sequence. Such issues are
readily resolved as the clones progress towards ®nished sequence,
but they necessitate some caution in certain applications of the
current data.
The sequenced clones were then associated with speci®c clones on
the physical map to produce a `layout'. In principle, sequenced
clones that correspond to ®ngerprinted BACs could be directly
assigned by name to ®ngerprint clone contigs on the ®ngerprint-
based physical map. In practice, however, laboratory mixups occa-
sionally resulted in incorrect assignments. To eliminate such pro-
blems, sequenced clones were associated with the ®ngerprint clone
contigs in the physical map by using the sequence data to calculate a
articles
868 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 3 Total human sequence deposited in the HTGS division of GenBank
Sequencing centre Total human
sequence (kb)
Finished human
sequence (kb)
Whitehead Institute, Center for Genome Research* 1,196,888 46,560
The Sanger Centre* 970,789 284,353
Washington University Genome Sequencing Center* 765,898 175,279
US DOE Joint Genome Institute 377,998 78,486
Baylor College of Medicine Human Genome Sequencing
Center
345,125 53,418
RIKEN Genomic Sciences Center 203,166 16,971
Genoscope 85,995 48,808
GTC Sequencing Center 71,357 7,014
Department of Genome Analysis, Institute of Molecular
Biotechnology
49,865 17,788
Beijing Genomics Institute/Human Genome Center 42,865 6,297
Multimegabase Sequencing Center; Institute for Systems
Biology
31,241 9,676
Stanford Genome Technology Center 29,728 3,530
The Stanford Human Genome Center and Department of
Genetics
28,162 9,121
University of Washington Genome Center 24,115 14,692
Keio University 17,364 13,058
University of Texas Southwestern Medical Center at Dallas 11,670 7,028
University of Oklahoma Advanced Center for Genome
Technology
10,071 9,155
Max Planck Institute for Molecular Genetics 7,650 2,940
GBF ± German Research Centre for Biotechnology 4,639 2,338
Cold Spring Harbor Laboratory Lita Annenberg Hazen
Genome Center
4,338 2,104
Other 59,574 35,911
Total 4,338,224 842,027
.............................................................................................................................................................................
Total human sequence deposited in GenBank by members of the International Human Genome
Sequencing Consortium, as of 8 October 2000.The amount of total sequence (®nished plus draft
plus predraft) is shown in the second column and the amount of ®nished sequence is shown in
the third column. Total sequence differs from totals in Tables 1 and 2 because of inclusion of
padding characters and of some clones not used in assembly. HTGS, high throughput genome
sequence.
*These three centres produced an additional 2.4 Gb of raw plasmid paired-end reads (see Table 4),
consisting of 0.99 Gb from Whitehead Institute, 0.66 Gb from The Sanger Centre and 0.75 Gb from
Washington University.
Table 4 Plasmid paired-end reads
Total reads deposited* Read pairs² Size range of inserts
(kb)
Random-sheared 3,227,685 1,155,284 1.8±6
Enzyme digest 2,539,222 761,010 0.8±4.7
Total 5,766,907 1,916,294
.............................................................................................................................................................................
The plasmid paired-end reads used a mixture of DNA from a set of 24 samples from the DNA
Polymorphism Discovery Resource (http://locus.umdnj.edu/nigms/pdr.html). This set of 24 anon-
ymous US residents contains samples from European-Americans, African-Americans, Mexican-
Americans, Native Americans and Asian-Americans, although the ethnicities of the individual
samples are not identi®ed. Informed consent to contribute samples to the DNA Polymorphism
Discovery Resource was obtained from all 450 individuals who contributed samples. Samples from
the European-American, African-American and Mexican-American individuals came from NHANES
(http://www.cdc.gov/nchs/nhanes.htm); individuals were recontacted to obtain their consent for
the Resource project. New samples were obtained from Asian-Americans whose ancestry was
from a variety of East and South Asian countries. New samples were also obtained for the Native
Americans; tribal permission was obtained ®rst, and then individual consents. See http://
www.nhgri.nih.gov/Grant_info/Funding/RFA/discover_polymorphisms.html and ref. 98.
*Re¯ects data deposited with and released by The SNP Consortium (see http://snp.cshl.org/data).
² Read pairs represents the number of cases in which sequence from both ends of a genomic
cloned fragment was determined and used in this study as linking information.
© 2001 Macmillan Magazines Ltd
produced by each centre was required to have fourfold average
sequence coverage, with no clone below threefold. (For this pur-
pose, sequence coverage was de®ned as the average number of times
that each base was independently read with a base-quality score
corresponding to at least 99% accuracy.) We attained an overall
average of 4.5-fold coverage across the genome for draft clones. A
few of the sequenced clones fell below the minimum of threefold
sequence coverage or have not been formally designated by centres
as meeting draft standards; these are referred to as predraft (Table 2).
Some of these are clones that span remaining gaps in the draft
genome sequence and were in the process of being sequenced on 7
October 2000; a few are old submissions from centres that are no
longer active.
The lengths of the initial sequence contigs in the draft clones vary
as a function of coverage, but half of all nucleotides reside in initial
sequence contigs of at least 21.7 kb (see below). Various properties
of the draft clones can be assessed from instances in which there was
substantial overlap between a draft clone and a ®nished (or nearly
®nished) clone. By examining the sequence alignments in the
overlap regions, we estimated that the initial sequence contigs in a
draft sequence clone cover an average of about 96% of the clone and
are separated by gaps with an average size of about 500 bp.
Although the main emphasis was on producing a draft genome
sequence, the centres also maintained sequence ®nishing activities
during this period, leading to a twofold increase in ®nished
sequence from June 1999 to June 2000 (Fig. 4). The total amount
of human sequence in this ®nal form stood at more than 835 Mb on
7 October 2000, or more than 25% of the human genome. This
includes the ®nished sequences of chromosomes 21 and 22 (refs 93,
94). As centres have begun to shift from draft to ®nished sequencing
in the last quarter of 2000, the production of ®nished sequence has
increased to an annualized rate of 1 Gb per year and is continuing to
rise.
In addition to sequencing large-insert clones, three centres
generated a large collection of random raw sequence reads from
whole-genome shotgun libraries (Table 4; ref. 98). These 5.77
million successful sequences contained 2.4 Gb of high-quality
bases; this corresponds to about 0.75-fold coverage and would be
statistically expected to include about 50% of the nucleotides in the
human genome (data available at http://snp.cshl.org/data). The
primary objective of this work was to discover SNPs, by comparing
these random raw sequences (which came from different individ-
uals) with the draft genome sequence. However, many of these raw
sequences were obtained from both ends of plasmid clones and
thereby also provided valuable `linking' information that was used
in sequence assembly. In addition, the random raw sequences
provide sequence coverage of about half of the nucleotides not yet
represented in the sequenced large-insert clones; these can be used
as probes for portions of the genome not yet recovered.
Assembly of the draft genome sequence
We then set out to assemble the sequences from the individual large-
insert clones into an integrated draft sequence of the human
genome. The assembly process had to resolve problems arising
from the draft nature of much of the sequence, from the variety of
clone sources, and from the high fraction of repeated sequences in
the human genome. This process involved three steps: ®ltering,
layout and merging.
The entire data set was ®ltered uniformly to eliminate contam-
ination from nonhuman sequences and other artefacts that had not
already been removed by the individual centres. (Information about
contamination was also sent back to the centres, which are updating
the individual entries in the public databases.) We also identi®ed
instances in which the sequence data from one BAC clone was
substantially contaminated with sequence data from another
(human or nonhuman) clone. The problems were resolved in
most instances; 231 clones remained unresolved, and these were
eliminated from the assembly reported here. Instances of lower
levels of cross-contamination (for example, a single 96-well micro-
plate misassigned to the wrong BAC) are more dif®cult to detect;
some undoubtedly remain and may give rise to small spurious
sequence contigs in the draft genome sequence. Such issues are
readily resolved as the clones progress towards ®nished sequence,
but they necessitate some caution in certain applications of the
current data.
The sequenced clones were then associated with speci®c clones on
the physical map to produce a `layout'. In principle, sequenced
clones that correspond to ®ngerprinted BACs could be directly
assigned by name to ®ngerprint clone contigs on the ®ngerprint-
based physical map. In practice, however, laboratory mixups occa-
sionally resulted in incorrect assignments. To eliminate such pro-
blems, sequenced clones were associated with the ®ngerprint clone
contigs in the physical map by using the sequence data to calculate a
articles
868 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 3 Total human sequence deposited in the HTGS division of GenBank
Sequencing centre Total human
sequence (kb)
Finished human
sequence (kb)
Whitehead Institute, Center for Genome Research* 1,196,888 46,560
The Sanger Centre* 970,789 284,353
Washington University Genome Sequencing Center* 765,898 175,279
US DOE Joint Genome Institute 377,998 78,486
Baylor College of Medicine Human Genome Sequencing
Center
345,125 53,418
RIKEN Genomic Sciences Center 203,166 16,971
Genoscope 85,995 48,808
GTC Sequencing Center 71,357 7,014
Department of Genome Analysis, Institute of Molecular
Biotechnology
49,865 17,788
Beijing Genomics Institute/Human Genome Center 42,865 6,297
Multimegabase Sequencing Center; Institute for Systems
Biology
31,241 9,676
Stanford Genome Technology Center 29,728 3,530
The Stanford Human Genome Center and Department of
Genetics
28,162 9,121
University of Washington Genome Center 24,115 14,692
Keio University 17,364 13,058
University of Texas Southwestern Medical Center at Dallas 11,670 7,028
University of Oklahoma Advanced Center for Genome
Technology
10,071 9,155
Max Planck Institute for Molecular Genetics 7,650 2,940
GBF ± German Research Centre for Biotechnology 4,639 2,338
Cold Spring Harbor Laboratory Lita Annenberg Hazen
Genome Center
4,338 2,104
Other 59,574 35,911
Total 4,338,224 842,027
.............................................................................................................................................................................
Total human sequence deposited in GenBank by members of the International Human Genome
Sequencing Consortium, as of 8 October 2000.The amount of total sequence (®nished plus draft
plus predraft) is shown in the second column and the amount of ®nished sequence is shown in
the third column. Total sequence differs from totals in Tables 1 and 2 because of inclusion of
padding characters and of some clones not used in assembly. HTGS, high throughput genome
sequence.
*These three centres produced an additional 2.4 Gb of raw plasmid paired-end reads (see Table 4),
consisting of 0.99 Gb from Whitehead Institute, 0.66 Gb from The Sanger Centre and 0.75 Gb from
Washington University.
Table 4 Plasmid paired-end reads
Total reads deposited* Read pairs² Size range of inserts
(kb)
Random-sheared 3,227,685 1,155,284 1.8±6
Enzyme digest 2,539,222 761,010 0.8±4.7
Total 5,766,907 1,916,294
.............................................................................................................................................................................
The plasmid paired-end reads used a mixture of DNA from a set of 24 samples from the DNA
Polymorphism Discovery Resource (http://locus.umdnj.edu/nigms/pdr.html). This set of 24 anon-
ymous US residents contains samples from European-Americans, African-Americans, Mexican-
Americans, Native Americans and Asian-Americans, although the ethnicities of the individual
samples are not identi®ed. Informed consent to contribute samples to the DNA Polymorphism
Discovery Resource was obtained from all 450 individuals who contributed samples. Samples from
the European-American, African-American and Mexican-American individuals came from NHANES
(http://www.cdc.gov/nchs/nhanes.htm); individuals were recontacted to obtain their consent for
the Resource project. New samples were obtained from Asian-Americans whose ancestry was
from a variety of East and South Asian countries. New samples were also obtained for the Native
Americans; tribal permission was obtained ®rst, and then individual consents. See http://
www.nhgri.nih.gov/Grant_info/Funding/RFA/discover_polymorphisms.html and ref. 98.
*Re¯ects data deposited with and released by The SNP Consortium (see http://snp.cshl.org/data).
² Read pairs represents the number of cases in which sequence from both ends of a genomic
cloned fragment was determined and used in this study as linking information.
© 2001 Macmillan Magazines Ltd
Page 11
The remaining 39 contigs containing 0.3% of the sequence were not
positioned at all.
We then merged the sequences from overlapping sequenced
clones (Fig. 6), using the computer program GigAssembler
104
. The
program considers nearby sequenced clones, detects overlaps
between the initial sequence contigs in these clones, merges the
overlapping sequences and attempts to order and orient the
sequence contigs. It begins by aligning the initial sequence contigs
from one clone with those from other clones in the same ®ngerprint
clone contig on the basis of length of alignment, per cent identity of
the alignment, position in the sequenced clone layout and other
factors. Alignments are limited to one end of each initial sequence
contig for partially overlapping contigs or to both ends of an initial
sequence contig contained entirely within another; this eliminates
internal alignments that may re¯ect repeated sequence or possible
misassembly (Fig. 6b). Beginning with the highest scoring pairs,
initial sequence contigs are then integrated to produce `merged
sequence contigs' (usually referred to simply as `sequence contigs').
The program re®nes the arrangement of the clones within the
®ngerprint clone contig on the basis of the extent of sequence
overlap between them and then rebuilds the sequence contigs. Next,
the program selects a sequence path through the sequence contigs
(Fig. 6c). It tries to use the highest quality data by preferring longer
initial sequence contigs and avoiding the ®rst and last 250 bases of
initial sequence contigs where possible. Finally, it attempts to order
and orient the sequence contigs by using additional information,
including sequence data from paired-end plasmid and BAC reads,
known messenger RNAs and ESTs, as well as additional linking
information provided by centres. The sequence contigs are thereby
linked together to create `sequence-contig scaffolds' (Fig. 6d). The
process also joins overlapping sequenced clones into sequenced-
clone contigs and links sequenced-clone contigs to form sequenced-
clone-contig scaffolds. A ®ngerprint clone contig may contain
several sequenced-clone contigs, because bridging clones remain
to be sequenced. The assembly contained 4,884 sequenced-clone
articles
870 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Fingerprint clone contig
Sequenced-clone contig
Pick clones for sequencing
Merge data
Sequenced clone A
Sequenced clone B
Sequence to at least draft coverage
Initial sequence contig
Sequenced-clone-contig scaffold
Merged sequence contig
Sequence-contig scaffold
Order and orient with mRNA, paired end reads, other information
A
B
Figure 7 Levels of clone and sequence coverage. A `®ngerprint clone contig' is
assembled by using the computer program FPC
84,451
to analyse the restriction enzyme
digestion patterns of many large-insert clones. Clones are then selected for sequencing to
minimize overlap between adjacent clones. For a clone to be selected, all of its restriction
enzyme fragments (except the two vector-insert junction fragments) must be shared with
at least one of its neighbours on each side in the contig. Once these overlapping clones
have been sequenced, the set is a `sequenced-clone contig'. When all selected clones
from a ®ngerprint clone contig have been sequenced, the sequenced-clone contig will be
the same as the ®ngerprint clone contig. Until then, a ®ngerprint clone contig may contain
several sequenced-clone contigs. After individual clones (for example, A and B) have been
sequenced to draft coverage and the clones have been mapped, the data are analysed by
GigAssembler (Fig. 6), producing merged sequence contigs from initial sequence contigs,
and linking these to form sequence-contig scaffolds (see Box 1).
Table 5 The draft genome sequence
Chromosome Sequence from clones (kb) Sequence from contigs (kb)
Finished clones Draft clones Pre-draft clones Contigs containing
®nished clones
Deep coverage
sequence contigs
Draft/predraft
sequence contigs
All 826,441 1,734,995 131,476 958,922 840,815 893,175
1 50,851 149,027 12,356 61,001 78,773 72,461
2 46,909 167,439 7,210 53,775 81,569 86,214
3 22,350 152,840 11,057 26,959 79,649 79,638
4 15,914 134,973 17,261 19,096 66,165 82,887
5 37,973 129,581 2,160 48,895 61,387 59,431
6 75,312 76,082 6,696 93,458 28,204 36,428
7 94,845 47,328 4,047 103,188 14,434 28,597
8 14,538 102,484 7,236 16,659 47,198 60,400
9 18,401 77,648 10,864 24,030 42,653 40,230
10 16,889 99,181 11,066 21,421 54,054 51,662
11 13,162 111,092 4,352 16,145 65,147 47,314
12 32,156 84,653 7,651 37,519 43,995 42,946
13 16,818 68,983 7,136 22,191 38,319 32,429
14 58,989 27,370 565 78,302 3,267 5,355
15 2,739 67,453 3,211 3,112 34,758 35,533
16 22,987 48,997 1,143 27,751 20,892 24,484
17 29,881 36,349 6,600 33,531 14,671 24,628
18 5,128 65,284 2,352 6,656 40,947 25,160
19 28,481 26,568 369 32,228 7,188 16,003
20 54,217 5,302 976 56,534 1,065 2,896
21 33,824 0 0 33,824 0 0
22 33,786 0 0 33,786 0 0
X 77,630 45,100 4,941 83,796 14,056 29,820
Y 18,169 3,221 363 20,222 333 1,198
NA 2,434 1,858 844 2,446 122 2,568
UL 2,056 6,182 1,020 2,395 1,969 4,894
...................................................................................................................................................................................................................................................................................................................................................................
The table presents summary statistics for the draft genome sequence over the entire genome and by individual chromosome. NA, clones that could not be placed into the sequenced clone layout. UL,
clones that could be placed in the layout, but that could not reliably be placed on a chromosome. First three columns, data from ®nished clones, draft clones and predraft clones. The last three columns
break the data down according to the type of sequence contig. Contigs containing ®nished clones represent sequence contigs that consist of ®nished sequence plus any (small) extensions from merged
sequence contigs that arise from overlap with ¯anking draft clones. Deep coverage sequence contigs include sequence from two or more overlapping un®nished clones; they consist of roughly full shotgun
coverage and thus are longer than the average un®nished sequence contig. Draft/predraft sequence contigs are all of the other sequence contigs in un®nished clones. Thus, the draft genome sequence
consists of approximately one-third ®nished sequence, one-third deep coverage sequence and one-third draft/pre-draft coverage sequence. In all of the statistics, we count only nonoverlapping bases in
the draft genome sequence.
© 2001 Macmillan Magazines Ltd
positioned at all.
We then merged the sequences from overlapping sequenced
clones (Fig. 6), using the computer program GigAssembler
104
. The
program considers nearby sequenced clones, detects overlaps
between the initial sequence contigs in these clones, merges the
overlapping sequences and attempts to order and orient the
sequence contigs. It begins by aligning the initial sequence contigs
from one clone with those from other clones in the same ®ngerprint
clone contig on the basis of length of alignment, per cent identity of
the alignment, position in the sequenced clone layout and other
factors. Alignments are limited to one end of each initial sequence
contig for partially overlapping contigs or to both ends of an initial
sequence contig contained entirely within another; this eliminates
internal alignments that may re¯ect repeated sequence or possible
misassembly (Fig. 6b). Beginning with the highest scoring pairs,
initial sequence contigs are then integrated to produce `merged
sequence contigs' (usually referred to simply as `sequence contigs').
The program re®nes the arrangement of the clones within the
®ngerprint clone contig on the basis of the extent of sequence
overlap between them and then rebuilds the sequence contigs. Next,
the program selects a sequence path through the sequence contigs
(Fig. 6c). It tries to use the highest quality data by preferring longer
initial sequence contigs and avoiding the ®rst and last 250 bases of
initial sequence contigs where possible. Finally, it attempts to order
and orient the sequence contigs by using additional information,
including sequence data from paired-end plasmid and BAC reads,
known messenger RNAs and ESTs, as well as additional linking
information provided by centres. The sequence contigs are thereby
linked together to create `sequence-contig scaffolds' (Fig. 6d). The
process also joins overlapping sequenced clones into sequenced-
clone contigs and links sequenced-clone contigs to form sequenced-
clone-contig scaffolds. A ®ngerprint clone contig may contain
several sequenced-clone contigs, because bridging clones remain
to be sequenced. The assembly contained 4,884 sequenced-clone
articles
870 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Fingerprint clone contig
Sequenced-clone contig
Pick clones for sequencing
Merge data
Sequenced clone A
Sequenced clone B
Sequence to at least draft coverage
Initial sequence contig
Sequenced-clone-contig scaffold
Merged sequence contig
Sequence-contig scaffold
Order and orient with mRNA, paired end reads, other information
A
B
Figure 7 Levels of clone and sequence coverage. A `®ngerprint clone contig' is
assembled by using the computer program FPC
84,451
to analyse the restriction enzyme
digestion patterns of many large-insert clones. Clones are then selected for sequencing to
minimize overlap between adjacent clones. For a clone to be selected, all of its restriction
enzyme fragments (except the two vector-insert junction fragments) must be shared with
at least one of its neighbours on each side in the contig. Once these overlapping clones
have been sequenced, the set is a `sequenced-clone contig'. When all selected clones
from a ®ngerprint clone contig have been sequenced, the sequenced-clone contig will be
the same as the ®ngerprint clone contig. Until then, a ®ngerprint clone contig may contain
several sequenced-clone contigs. After individual clones (for example, A and B) have been
sequenced to draft coverage and the clones have been mapped, the data are analysed by
GigAssembler (Fig. 6), producing merged sequence contigs from initial sequence contigs,
and linking these to form sequence-contig scaffolds (see Box 1).
Table 5 The draft genome sequence
Chromosome Sequence from clones (kb) Sequence from contigs (kb)
Finished clones Draft clones Pre-draft clones Contigs containing
®nished clones
Deep coverage
sequence contigs
Draft/predraft
sequence contigs
All 826,441 1,734,995 131,476 958,922 840,815 893,175
1 50,851 149,027 12,356 61,001 78,773 72,461
2 46,909 167,439 7,210 53,775 81,569 86,214
3 22,350 152,840 11,057 26,959 79,649 79,638
4 15,914 134,973 17,261 19,096 66,165 82,887
5 37,973 129,581 2,160 48,895 61,387 59,431
6 75,312 76,082 6,696 93,458 28,204 36,428
7 94,845 47,328 4,047 103,188 14,434 28,597
8 14,538 102,484 7,236 16,659 47,198 60,400
9 18,401 77,648 10,864 24,030 42,653 40,230
10 16,889 99,181 11,066 21,421 54,054 51,662
11 13,162 111,092 4,352 16,145 65,147 47,314
12 32,156 84,653 7,651 37,519 43,995 42,946
13 16,818 68,983 7,136 22,191 38,319 32,429
14 58,989 27,370 565 78,302 3,267 5,355
15 2,739 67,453 3,211 3,112 34,758 35,533
16 22,987 48,997 1,143 27,751 20,892 24,484
17 29,881 36,349 6,600 33,531 14,671 24,628
18 5,128 65,284 2,352 6,656 40,947 25,160
19 28,481 26,568 369 32,228 7,188 16,003
20 54,217 5,302 976 56,534 1,065 2,896
21 33,824 0 0 33,824 0 0
22 33,786 0 0 33,786 0 0
X 77,630 45,100 4,941 83,796 14,056 29,820
Y 18,169 3,221 363 20,222 333 1,198
NA 2,434 1,858 844 2,446 122 2,568
UL 2,056 6,182 1,020 2,395 1,969 4,894
...................................................................................................................................................................................................................................................................................................................................................................
The table presents summary statistics for the draft genome sequence over the entire genome and by individual chromosome. NA, clones that could not be placed into the sequenced clone layout. UL,
clones that could be placed in the layout, but that could not reliably be placed on a chromosome. First three columns, data from ®nished clones, draft clones and predraft clones. The last three columns
break the data down according to the type of sequence contig. Contigs containing ®nished clones represent sequence contigs that consist of ®nished sequence plus any (small) extensions from merged
sequence contigs that arise from overlap with ¯anking draft clones. Deep coverage sequence contigs include sequence from two or more overlapping un®nished clones; they consist of roughly full shotgun
coverage and thus are longer than the average un®nished sequence contig. Draft/predraft sequence contigs are all of the other sequence contigs in un®nished clones. Thus, the draft genome sequence
consists of approximately one-third ®nished sequence, one-third deep coverage sequence and one-third draft/pre-draft coverage sequence. In all of the statistics, we count only nonoverlapping bases in
the draft genome sequence.
© 2001 Macmillan Magazines Ltd
Page 12
contigs in 942 ®ngerprint clone contigs.
The hierarchy of contigs is summarized in Fig. 7. Initial sequence
contigs are integrated to create merged sequence contigs, which are
then linked to form sequence-contig scaffolds. These scaffolds reside
within sequenced-clone contigs, which in turn reside within ®nger-
print clone contigs.
The draft genome sequence
The result of the assembly process is an integrated draft sequence of
the human genome. Several features of the draft genome sequence
are reported in Tables 5±7, including the proportion represented by
®nished, draft and predraft categories. The Tables also show the
numbers and lengths of different types of contig, for each chromo-
some and for the genome as a whole.
The contiguity of the draft genome sequence at each level is an
important feature. Two commonly used statistics have signi®cant
drawbacks for describing contiguity. The `average length' of a contig
is de¯ated by the presence of many small contigs comprising only a
small proportion of the genome, whereas the `length-weighted
average length' is in¯ated by the presence of large segments of
®nished sequence. Instead, we chose to describe the contiguity as a
property of the `typical' nucleotide. We used a statistic called the
`N50 length', de®ned as the largest length L such that 50% of all
nucleotides are contained in contigs of size at least L.
The continuity of the draft genome sequence reported here and
the effectiveness of assembly can be readily seen from the following:
half of all nucleotides reside within an initial sequence contig of at
least 21.7 kb, a sequence contig of at least 82 kb, a sequence-contig
scaffold of at least 274 kb, a sequenced-clone contig of at least 826 kb
and a ®ngerprint clone contig of at least 8.4 Mb (Tables 6, 7). The
cumulative distributions for each of these measures of contiguity
are shown in Fig. 8, in which the N50 values for each measure can be
seen as the value at which the cumulative distributions cross 50%.
We have also estimated the size of each chromosome, by estimating
the gap sizes (see below) and the extent of missing heterochromatic
sequence
93,94,105±108
(Table 8). This is undoubtedly an oversimpli®ca-
tion and does not adequately take into account the sequence status
of each chromosome. Nonetheless, it provides a useful way to relate
the draft sequence to the chromosomes.
Quality assessment
The draft genome sequence already covers the vast majority of the
genome, but it remains an incomplete, intermediate product that is
regularly updated as we work towards a complete ®nished sequence.
The current version contains many gaps and errors. We therefore
sought to evaluate the quality of various aspects of the current draft
genome sequence, including the sequenced clones themselves, their
assignment to a position in the ®ngerprint clone contigs, and the
assembly of initial sequence contigs from the individual clones into
sequence-contig scaffolds.
Nucleotide accuracy is re¯ected in a PHRAP score assigned to
each base in the draft genome sequence and available to users
through the Genome Browsers (see below) and public database
entries. A summary of these scores for the un®nished portion of the
genome is shown in Table 9. About 91% of the un®nished draft
genome sequence has an error rate of less than 1 per 10,000 bases
(PHRAP score . 40), and about 96% has an error rate of less than 1
in 1,000 bases (PHRAP . 30). These values are based only on the
quality scores for the bases in the sequenced clones; they do not
re¯ect additional con®dence in the sequences that are represented in
overlapping clones. The ®nished portion of the draft genome
sequence has an error rate of less than 1 per 10,000 bases.
Individual sequenced clones. We assessed the frequency of mis-
assemblies, which can occur when the assembly program PHRAP
joins two nonadjacent regions in the clone into a single initial
sequence contig. The frequency of misassemblies depends heavily
on the depth and quality of coverage of each clone and the nature of
the underlying sequence; thus it may vary among genomic regions
and among individual centres. Most clone misassemblies are readily
corrected as coverage is added during ®nishing, but they may have
been propagated into the current version of the draft genome
sequence and they justify caution for certain applications.
We estimated the frequency of misassembly by examining
instances in which there was substantial overlap between a draft
clone and a ®nished clone. We studied 83 Mb of such overlaps,
involving about 9,000 initial sequence contigs. We found 5.3
instances per Mb in which the alignment of an initial sequence
contig to the ®nished sequence failed to extend to within 200 bases
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 871
Table 6 Clone level contiguity of the draft genome sequence
Chromosome Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs with sequence
Number N50 length (kb) Number N50 length (kb) Number N50 length (kb)
All 4,884 826 2,191 2,279 942 8,398
1 453 650 197 1,915 106 3,537
2 348 1,028 127 3,140 52 10,628
3 409 672 201 1,550 73 5,077
4 384 606 163 1,659 41 6,918
5 385 623 164 1,642 48 5,747
6 292 814 98 3,292 17 24,680
7 224 1.074 86 3,527 29 20,401
8 292 542 115 1,742 43 6,236
9 143 1,242 78 2,411 21 29,108
10 179 1,097 105 1,952 16 30,284
11 224 887 89 3,024 31 9,414
12 196 1,138 76 2,717 28 9,546
13 128 1,151 56 3,257 13 25,256
14 54 3,079 27 8,489 14 22,128
15 123 797 56 2,095 19 8,274
16 159 620 92 1,317 57 2,716
17 138 831 58 2,138 43 2,816
18 137 709 47 2,572 24 4,887
19 159 569 79 1,200 51 1,534
20 42 2,318 20 6,862 9 23,489
21 5 28,515 5 28,515 5 28,515
22 11 23,048 11 23,048 11 23,048
X 325 572 181 1,082 143 1,436
Y 27 1,539 20 3,290 8 5,135
UL 47 227 40 281 40 281
...................................................................................................................................................................................................................................................................................................................................................................
Number and size of sequenced-clone contigs, sequenced-clone-contig scaffolds and those ®ngerprint clone contigs (see Box 1) that contain sequenced clones; some small ®ngerprint clone contigs do not
as yet have associated sequence. UL, ®ngerprint clone contigs that could not reliably be placed on a chromosome. These length estimates are from the draft genome sequence, in which gaps between
sequence contigs are arbitrarily represented with 100 Ns and gaps between sequence clone contigs with 50,000 Ns for `bridged gaps' and 100,000 Ns for `unbridged gaps'. These arbitrary values differ
minimally from empirical estimates of gap size (see text), and using the empirically derived estimates would change the N50 lengths presented here only slightly. For un®nished chromosomes, the N50 length
ranges from 1.5 to 3 times the arithmetic mean for sequenced-clone contigs, 1.5 to 3 times for sequenced-clone-contig scaffolds, and 1.5 to 6 times for ®ngerprint clone contigs with sequence.
© 2001 Macmillan Magazines Ltd
The hierarchy of contigs is summarized in Fig. 7. Initial sequence
contigs are integrated to create merged sequence contigs, which are
then linked to form sequence-contig scaffolds. These scaffolds reside
within sequenced-clone contigs, which in turn reside within ®nger-
print clone contigs.
The draft genome sequence
The result of the assembly process is an integrated draft sequence of
the human genome. Several features of the draft genome sequence
are reported in Tables 5±7, including the proportion represented by
®nished, draft and predraft categories. The Tables also show the
numbers and lengths of different types of contig, for each chromo-
some and for the genome as a whole.
The contiguity of the draft genome sequence at each level is an
important feature. Two commonly used statistics have signi®cant
drawbacks for describing contiguity. The `average length' of a contig
is de¯ated by the presence of many small contigs comprising only a
small proportion of the genome, whereas the `length-weighted
average length' is in¯ated by the presence of large segments of
®nished sequence. Instead, we chose to describe the contiguity as a
property of the `typical' nucleotide. We used a statistic called the
`N50 length', de®ned as the largest length L such that 50% of all
nucleotides are contained in contigs of size at least L.
The continuity of the draft genome sequence reported here and
the effectiveness of assembly can be readily seen from the following:
half of all nucleotides reside within an initial sequence contig of at
least 21.7 kb, a sequence contig of at least 82 kb, a sequence-contig
scaffold of at least 274 kb, a sequenced-clone contig of at least 826 kb
and a ®ngerprint clone contig of at least 8.4 Mb (Tables 6, 7). The
cumulative distributions for each of these measures of contiguity
are shown in Fig. 8, in which the N50 values for each measure can be
seen as the value at which the cumulative distributions cross 50%.
We have also estimated the size of each chromosome, by estimating
the gap sizes (see below) and the extent of missing heterochromatic
sequence
93,94,105±108
(Table 8). This is undoubtedly an oversimpli®ca-
tion and does not adequately take into account the sequence status
of each chromosome. Nonetheless, it provides a useful way to relate
the draft sequence to the chromosomes.
Quality assessment
The draft genome sequence already covers the vast majority of the
genome, but it remains an incomplete, intermediate product that is
regularly updated as we work towards a complete ®nished sequence.
The current version contains many gaps and errors. We therefore
sought to evaluate the quality of various aspects of the current draft
genome sequence, including the sequenced clones themselves, their
assignment to a position in the ®ngerprint clone contigs, and the
assembly of initial sequence contigs from the individual clones into
sequence-contig scaffolds.
Nucleotide accuracy is re¯ected in a PHRAP score assigned to
each base in the draft genome sequence and available to users
through the Genome Browsers (see below) and public database
entries. A summary of these scores for the un®nished portion of the
genome is shown in Table 9. About 91% of the un®nished draft
genome sequence has an error rate of less than 1 per 10,000 bases
(PHRAP score . 40), and about 96% has an error rate of less than 1
in 1,000 bases (PHRAP . 30). These values are based only on the
quality scores for the bases in the sequenced clones; they do not
re¯ect additional con®dence in the sequences that are represented in
overlapping clones. The ®nished portion of the draft genome
sequence has an error rate of less than 1 per 10,000 bases.
Individual sequenced clones. We assessed the frequency of mis-
assemblies, which can occur when the assembly program PHRAP
joins two nonadjacent regions in the clone into a single initial
sequence contig. The frequency of misassemblies depends heavily
on the depth and quality of coverage of each clone and the nature of
the underlying sequence; thus it may vary among genomic regions
and among individual centres. Most clone misassemblies are readily
corrected as coverage is added during ®nishing, but they may have
been propagated into the current version of the draft genome
sequence and they justify caution for certain applications.
We estimated the frequency of misassembly by examining
instances in which there was substantial overlap between a draft
clone and a ®nished clone. We studied 83 Mb of such overlaps,
involving about 9,000 initial sequence contigs. We found 5.3
instances per Mb in which the alignment of an initial sequence
contig to the ®nished sequence failed to extend to within 200 bases
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 871
Table 6 Clone level contiguity of the draft genome sequence
Chromosome Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs with sequence
Number N50 length (kb) Number N50 length (kb) Number N50 length (kb)
All 4,884 826 2,191 2,279 942 8,398
1 453 650 197 1,915 106 3,537
2 348 1,028 127 3,140 52 10,628
3 409 672 201 1,550 73 5,077
4 384 606 163 1,659 41 6,918
5 385 623 164 1,642 48 5,747
6 292 814 98 3,292 17 24,680
7 224 1.074 86 3,527 29 20,401
8 292 542 115 1,742 43 6,236
9 143 1,242 78 2,411 21 29,108
10 179 1,097 105 1,952 16 30,284
11 224 887 89 3,024 31 9,414
12 196 1,138 76 2,717 28 9,546
13 128 1,151 56 3,257 13 25,256
14 54 3,079 27 8,489 14 22,128
15 123 797 56 2,095 19 8,274
16 159 620 92 1,317 57 2,716
17 138 831 58 2,138 43 2,816
18 137 709 47 2,572 24 4,887
19 159 569 79 1,200 51 1,534
20 42 2,318 20 6,862 9 23,489
21 5 28,515 5 28,515 5 28,515
22 11 23,048 11 23,048 11 23,048
X 325 572 181 1,082 143 1,436
Y 27 1,539 20 3,290 8 5,135
UL 47 227 40 281 40 281
...................................................................................................................................................................................................................................................................................................................................................................
Number and size of sequenced-clone contigs, sequenced-clone-contig scaffolds and those ®ngerprint clone contigs (see Box 1) that contain sequenced clones; some small ®ngerprint clone contigs do not
as yet have associated sequence. UL, ®ngerprint clone contigs that could not reliably be placed on a chromosome. These length estimates are from the draft genome sequence, in which gaps between
sequence contigs are arbitrarily represented with 100 Ns and gaps between sequence clone contigs with 50,000 Ns for `bridged gaps' and 100,000 Ns for `unbridged gaps'. These arbitrary values differ
minimally from empirical estimates of gap size (see text), and using the empirically derived estimates would change the N50 lengths presented here only slightly. For un®nished chromosomes, the N50 length
ranges from 1.5 to 3 times the arithmetic mean for sequenced-clone contigs, 1.5 to 3 times for sequenced-clone-contig scaffolds, and 1.5 to 6 times for ®ngerprint clone contigs with sequence.
© 2001 Macmillan Magazines Ltd
Page 13
of the end of the contig, suggesting a possible false join in the
assembly of the initial sequence contig. In about half of these cases,
the potential misassembly involved fewer than 400 bases, suggesting
that a single raw sequence read may have been incorrectly joined. We
found 1.9 instances per Mb in which the alignment showed an
internal gap, again suggesting a possible misassembly; and 0.5
instances per Mb in which the alignment indicated that two initial
sequence contigs that overlapped by at least 150 bp had not been
merged by PHRAP. Finally, there were another 0.9 instances per Mb
with various other problems. This gives a total of 8.6 instances per
Mb of possible misassembly, with about half being relatively small
issues involving a few hundred bases.
Some of the potential problems might not result from misassem-
bly, but might re¯ect sequence polymorphism in the population,
small rearrangements during growth of the large-insert clones,
regions of low-quality sequence or matches between segmental
duplications. Thus, the frequency of misassemblies may be over-
stated. On the other hand, the criteria for recognizing overlap
between draft and ®nished clones may have eliminated some
misassemblies.
Layout of the sequenced clones. We assessed the accuracy of the
layout of sequenced clones onto the ®ngerprinted clone contigs by
calculating the concordance between the positions assigned to a
sequenced clone on the basis of in silico digestion and the position
assigned on the basis of BAC end sequence data. The positions
agreed in 98% of cases in which independent assignments could be
made by both methods. The results were also compared with well
studied regions containing both ®nished and draft genome
sequence. These results indicated that sequenced clone order in
the ®ngerprint map was reliable to within about half of one clone
length (,100 kb).
A direct test of the layout is also provided by the draft genome
sequence assembly itself. With extensive coverage of the genome, a
correctly placed clone should usually (although not always) show
sequence overlap with its neighbours in the map. We found only 421
instances of `singleton' clones that failed to overlap a neighbouring
clone. Close examination of the data suggests that most of these are
correctly placed, but simply do not yet overlap an adjacent
sequenced clone. About 150 clones appeared to be candidates for
being incorrectly placed.
Alignment of the ®ngerprint clone contigs. The alignment of the
®ngerprint clone contigs with the chromosomes was based on the
radiation hybrid, YAC and genetic maps of STSs. The positions of
most of the STSs in the draft genome sequence were consistent with
these previous maps, but the positions of about 1.7% differed from
one or more of them. Some of these disagreements may be due to
errors in the layout of the sequenced clones or in the underlying
articles
872 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
0 100 200 300 400 500 600 700 800 900 1,000
0
10
20
30
40
50
60
70
80
90
100
Size (kb)
Sequence level continuity
Clone level continuity
C
um
ul
at
iv
e
pe
rc
en
ta
ge
b
a
Initial sequence contigs
Sequence contigs
Sequence-contig scaffolds
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000
0
10
20
30
40
50
60
70
80
90
100
Size (kb)
C
um
ul
at
iv
e
pe
rc
en
ta
ge
Sequenced clones
Sequenced-clone contigs
Sequenced-clone-contig scaffolds
Fingerprint clone contigs
Figure 8 Cumulative distributions of several measures of clone level contiguity and
sequence contiguity. The ®gures represent the proportion of the draft genome sequence
contained in contigs of at most the indicated size. a, Clone level contiguity. The clones
have a tight size distribution with an N50 of , 160 kb (corresponding to 50% on the
cumulative distribution). Sequenced-clone contigs represent the next level of continuity,
and are linked by mRNA sequences or pairs of BAC end sequences to yield the
sequenced-clone-contig scaffolds. The underlying contiguity of the layout of sequenced
clones against the ®ngerprinted clone contigs is only partially shown at this scale.
b, Sequence contiguity. The input fragments have low continuity (N50 = 21.7 kb). After
merging, the sequence contigs grow to an N50 length of about 82 kb. After linking,
sequence-contig scaffolds with an N50 length of about 274 kb are created.
Figure 9 Overview of features of draft human genome. The Figure shows the
occurrences of twelve important types of feature across the human genome. Large
grey blocks represent centromeres and centromeric heterochromatin (size not precisely to
scale). Each of the feature types is depicted in a track, from top to bottom as follows. (1)
Chromosome position in Mb. (2) The approximate positions of Giemsa-stained
chromosome bands at the 800 band resolution. (3) Level of coverage in the draft genome
sequence. Red, areas covered by ®nished clones; yellow, areas covered by predraft
sequence. Regions covered by draft sequenced clones are in orange, with darker shades
re¯ecting increasing shotgun sequence coverage. (4) GC content. Percentage of bases in
a 20,000 base window that are C or G. (5) Repeat density. Red line, density of SINE class
repeats in a 100,000-base window; blue line, density of LINE class repeats in a 100,000-
base window. (6) Density of SNPs in a 50,000-base window. The SNPs were detected by
sequencing and alignments of random genomic reads. Some of the heterogeneity in SNP
density re¯ects the methods used for SNP discovery. Rigorous analysis of SNP density
requires comparing the number of SNPs identi®ed to the precise number of bases
surveyed. (7) Non-coding RNA genes. Brown, functional RNA genes such as tRNAs,
snoRNAs and rRNAs; light orange, RNA pseudogenes. (8) CpG islands. Green ticks
represent regions of , 200 bases with CpG levels signi®cantly higher than in the genome
as a whole, and GC ratios of at least 50%. (9) Exo®sh ecores. Regions of homology with
the puffer®sh T. nigroviridis
292
are blue. (10) ESTs with at least one intron when aligned
against genomic DNA are shown as black tick marks. (11) The starts of genes predicted by
Genie or Ensembl are shown as red ticks. The starts of known genes from the RefSeq
database
110
are shown in blue. (12) The names of genes that have been uniquely located
in the draft genome sequence, characterized and named by the HGM Nomenclature
Committee. Known disease genes from the OMIM database are red, other genes blue.
This Figure is based on an earlier version of the draft genome sequence than analysed in
the text, owing to production constraints. We are aware of various errors in the Figure,
including omissions of some known genes and misplacements of others. Some genes are
mapped to more than one location, owing to errors in assembly, close paralogues or
pseudogenes. Manual review was performed to select the most likely location in these
cases and to correct other regions. For updated information, see http://genome.ucsc.edu/
and http://www.ensembl.org/.
Q
© 2001 Macmillan Magazines Ltd
assembly of the initial sequence contig. In about half of these cases,
the potential misassembly involved fewer than 400 bases, suggesting
that a single raw sequence read may have been incorrectly joined. We
found 1.9 instances per Mb in which the alignment showed an
internal gap, again suggesting a possible misassembly; and 0.5
instances per Mb in which the alignment indicated that two initial
sequence contigs that overlapped by at least 150 bp had not been
merged by PHRAP. Finally, there were another 0.9 instances per Mb
with various other problems. This gives a total of 8.6 instances per
Mb of possible misassembly, with about half being relatively small
issues involving a few hundred bases.
Some of the potential problems might not result from misassem-
bly, but might re¯ect sequence polymorphism in the population,
small rearrangements during growth of the large-insert clones,
regions of low-quality sequence or matches between segmental
duplications. Thus, the frequency of misassemblies may be over-
stated. On the other hand, the criteria for recognizing overlap
between draft and ®nished clones may have eliminated some
misassemblies.
Layout of the sequenced clones. We assessed the accuracy of the
layout of sequenced clones onto the ®ngerprinted clone contigs by
calculating the concordance between the positions assigned to a
sequenced clone on the basis of in silico digestion and the position
assigned on the basis of BAC end sequence data. The positions
agreed in 98% of cases in which independent assignments could be
made by both methods. The results were also compared with well
studied regions containing both ®nished and draft genome
sequence. These results indicated that sequenced clone order in
the ®ngerprint map was reliable to within about half of one clone
length (,100 kb).
A direct test of the layout is also provided by the draft genome
sequence assembly itself. With extensive coverage of the genome, a
correctly placed clone should usually (although not always) show
sequence overlap with its neighbours in the map. We found only 421
instances of `singleton' clones that failed to overlap a neighbouring
clone. Close examination of the data suggests that most of these are
correctly placed, but simply do not yet overlap an adjacent
sequenced clone. About 150 clones appeared to be candidates for
being incorrectly placed.
Alignment of the ®ngerprint clone contigs. The alignment of the
®ngerprint clone contigs with the chromosomes was based on the
radiation hybrid, YAC and genetic maps of STSs. The positions of
most of the STSs in the draft genome sequence were consistent with
these previous maps, but the positions of about 1.7% differed from
one or more of them. Some of these disagreements may be due to
errors in the layout of the sequenced clones or in the underlying
articles
872 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
0 100 200 300 400 500 600 700 800 900 1,000
0
10
20
30
40
50
60
70
80
90
100
Size (kb)
Sequence level continuity
Clone level continuity
C
um
ul
at
iv
e
pe
rc
en
ta
ge
b
a
Initial sequence contigs
Sequence contigs
Sequence-contig scaffolds
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000
0
10
20
30
40
50
60
70
80
90
100
Size (kb)
C
um
ul
at
iv
e
pe
rc
en
ta
ge
Sequenced clones
Sequenced-clone contigs
Sequenced-clone-contig scaffolds
Fingerprint clone contigs
Figure 8 Cumulative distributions of several measures of clone level contiguity and
sequence contiguity. The ®gures represent the proportion of the draft genome sequence
contained in contigs of at most the indicated size. a, Clone level contiguity. The clones
have a tight size distribution with an N50 of , 160 kb (corresponding to 50% on the
cumulative distribution). Sequenced-clone contigs represent the next level of continuity,
and are linked by mRNA sequences or pairs of BAC end sequences to yield the
sequenced-clone-contig scaffolds. The underlying contiguity of the layout of sequenced
clones against the ®ngerprinted clone contigs is only partially shown at this scale.
b, Sequence contiguity. The input fragments have low continuity (N50 = 21.7 kb). After
merging, the sequence contigs grow to an N50 length of about 82 kb. After linking,
sequence-contig scaffolds with an N50 length of about 274 kb are created.
Figure 9 Overview of features of draft human genome. The Figure shows the
occurrences of twelve important types of feature across the human genome. Large
grey blocks represent centromeres and centromeric heterochromatin (size not precisely to
scale). Each of the feature types is depicted in a track, from top to bottom as follows. (1)
Chromosome position in Mb. (2) The approximate positions of Giemsa-stained
chromosome bands at the 800 band resolution. (3) Level of coverage in the draft genome
sequence. Red, areas covered by ®nished clones; yellow, areas covered by predraft
sequence. Regions covered by draft sequenced clones are in orange, with darker shades
re¯ecting increasing shotgun sequence coverage. (4) GC content. Percentage of bases in
a 20,000 base window that are C or G. (5) Repeat density. Red line, density of SINE class
repeats in a 100,000-base window; blue line, density of LINE class repeats in a 100,000-
base window. (6) Density of SNPs in a 50,000-base window. The SNPs were detected by
sequencing and alignments of random genomic reads. Some of the heterogeneity in SNP
density re¯ects the methods used for SNP discovery. Rigorous analysis of SNP density
requires comparing the number of SNPs identi®ed to the precise number of bases
surveyed. (7) Non-coding RNA genes. Brown, functional RNA genes such as tRNAs,
snoRNAs and rRNAs; light orange, RNA pseudogenes. (8) CpG islands. Green ticks
represent regions of , 200 bases with CpG levels signi®cantly higher than in the genome
as a whole, and GC ratios of at least 50%. (9) Exo®sh ecores. Regions of homology with
the puffer®sh T. nigroviridis
292
are blue. (10) ESTs with at least one intron when aligned
against genomic DNA are shown as black tick marks. (11) The starts of genes predicted by
Genie or Ensembl are shown as red ticks. The starts of known genes from the RefSeq
database
110
are shown in blue. (12) The names of genes that have been uniquely located
in the draft genome sequence, characterized and named by the HGM Nomenclature
Committee. Known disease genes from the OMIM database are red, other genes blue.
This Figure is based on an earlier version of the draft genome sequence than analysed in
the text, owing to production constraints. We are aware of various errors in the Figure,
including omissions of some known genes and misplacements of others. Some genes are
mapped to more than one location, owing to errors in assembly, close paralogues or
pseudogenes. Manual review was performed to select the most likely location in these
cases and to correct other regions. For updated information, see http://genome.ucsc.edu/
and http://www.ensembl.org/.
Q
© 2001 Macmillan Magazines Ltd
Page 14
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 873
Table 7 Sequence level contiguity of the draft genome sequence
Chromosome Initial sequence contigs Sequence contigs Sequence-contig scaffolds
Number N50 length (kb) Number N50 length (kb) Number N50 length (kb)
All 396,913 21.7 149,821 81.9 87,757 274.3
1 37,656 16.5 12,256 59.1 5,457 278.4
2 32,280 19.9 13,228 57.3 6,959 248.5
3 38,848 15.6 15,098 37.7 8,964 167.4
4 28,600 16.0 13,152 33.0 7,402 158.9
5 30,096 20.4 10,689 72.9 6,378 241.2
6 17,472 43.6 5,547 180.3 2,554 485.0
7 12,733 86.4 4,562 335.7 2,726 591.3
8 19,042 18.1 8,984 38.2 4,631 198.9
9 15,955 20.1 6,226 55.6 3,766 216.2
10 21,762 18.7 9,126 47.9 6,886 133.0
11 29,723 14.3 8,503 40.0 4,684 193.2
12 22,050 19.1 8,422 63.4 5,526 217.0
13 13,737 21.7 5,193 70.5 2,659 300.1
14 4,470 161.4 829 1,371.0 541 2,009.5
15 13,134 15.3 5,840 30.3 3,229 149.7
16 10,297 34.4 4,916 119.5 3,337 356.3
17 10,369 22.9 4,339 90.6 2,616 248.9
18 16,266 15.3 4,461 51.4 2,540 216.1
19 6,009 38.4 2,503 134.4 1,551 375.5
20 2,884 108.6 511 1,346.7 312 813.8
21 103 340.0 5 28,515.3 5 28,515.3
22 526 113.9 11 23,048.1 11 23,048.1
X 11,062 58.8 4,607 218.6 2,610 450.7
Y 557 154.3 140 1,388.6 106 1,439.7
UL 1,282 21.4 613 46.0 297 166.4
...................................................................................................................................................................................................................................................................................................................................................................
This Table is similar to Table 6 but shows the number and N50 length for various types of sequence contig (see Box 1). See legend to Table 6 concerning treatment of gaps. For sequence contigs in the draft
genome sequence, the N50 length ranges from 1.7 to 5.5 times the arithmetic mean for initial sequence contigs, 2.5 to 8.2 times for merged sequence contigs, and 6.1 to 10 times for sequence-contig
scaffolds.
Table 8 Chromosome size estimates
Chromosome* Sequenced
bases² (Mb)
FCC gaps³ SCC gapsk Sequence gaps# Heterochromatin
and short arm
adjustments**(Mb)
Total estimated
chromosome size
(including
artefactual
duplication in draft
genome
sequence)
²²
(Mb)
Previously
estimated
chromosome
size
³³
(Mb)
Number Total bases
in gaps§ (Mb)
Number Total bases
in gaps¶ (Mb)
Number Total bases
in gaps
I
(Mb)
All 2,692.9 897 152.0 4,076 142.7 145,514 80.6 212 3,289 3,286
1 212.2 104 17.7 347 12.1 11,803 6.5 30 279 263
2 221.6 50 8.5 296 10.4 12,880 7.1 3 251 255
3 186.2 71 12.1 336 11.8 14,689 8.1 3 221 214
4 168.1 39 6.6 343 12.0 12,768 7.1 3 197 203
5 169.7 46 7.8 337 11.8 10,304 5.7 3 198 194
6 158.1 15 2.6 275 9.6 5,225 2.9 3 176 183
7 146.2 27 4.6 195 6.8 4,338 2.4 3 163 171
8 124.3 41 7.0 249 8.7 8,692 4.8 3 148 155
9 106.9 19 3.2 122 4.3 6,083 3.4 22 140 145
10 127.1 14 2.4 163 5.7 8,947 5.0 3 143 144
11 128.6 29 4.9 193 6.8 8,279 4.6 3 148 144
12 124.5 26 4.4 168 5.9 8,226 4.6 3 142 143
13 92.9 12 2.0 115 4.0 5,065 2.8 16 118 114
14 86.9 13 2.2 40 1.4 775 0.4 16 107 109
15 73.4 18 3.1 104 3.6 5,717 3.2 17 100 106
16 73.1 55 9.4 102 3.6 4,757 2.6 15 104 98
17 72.8 41 7.0 95 3.3 4,261 2.4 3 88 92
18 72.9 22 3.7 113 4.0 4,324 2.4 3 86 85
19 55.4 49 8.3 108 3.8 2,344 1.3 3 72 67
20 60.5 7 1.2 33 1.2 469 0.3 3 66 72
21 33.8 4 0.1 0 0.0 0 0.0 11 45 50
22 33.8 10 1.0 0 0.0 0 0.0 13 48 56
X 127.7 141 24.0 182 6.4 4,282 2.4 3 163 164
Y 21.8 6 1.0 19 0.7 113 0.1 27 51 59
NA 5.1 0 0 134 0.0 577 0.3 0 0 0
UL 9.3 38 0 7 0.0 566 0.3 0 0 0
...................................................................................................................................................................................................................................................................................................................................................................
* NA, sequenced clones that could not be associated with ®ngerprint clone contigs. UL, clone contigs that could not be reliably placed on a chromosome.
² Total number of bases in the draft genome sequence, excluding gaps. Total length of scaffold (including gaps contained within clones) is 2.916 Gb.
³ Gaps between those ®ngerprint clone contigs that contain sequenced clones excluding gaps for centromeres.
§ For un®nished chromosomes, we estimate an average size of 0.17 Mb per FCC gap, based on retrospective estimates of the clone coverage of chromosomes 21 and 22. Gap estimates for chromosomes
21 and 22 are taken from refs 93, 94.
kGaps between sequenced-clone contigs within a ®ngerprint clone contig.
¶ For un®nished chromosomes, we estimate sequenced clone gaps at 0.035 Mb each, based on evaluation of a sample of these gaps.
# Gaps between two sequence contigs within a sequenced-clone contig.
I
We estimate the average number of bases in sequence gaps from alignments of the initial sequence contigs of un®nished clones (see text) and extrapolation to the whole chromosome.
** Including adjustments for estimates of the sizes of the short arms of the acrocentric chromosomes 13, 14, 15, 21 and 22 (ref. 105), estimates for the centromere and heterochromatic regions of
chromosomes 1, 9 and 16 (refs 106, 107) and estimates of 3 Mb for the centromere and 24 Mb for telomeric heterochromatin for the Y chromosome
108
.
²² The sum of the ®ve lengths in the preceding columns. This is an overestimate, because the draft genome sequence contains some artefactual sequence owing to inability to correctly to merge all
underlying sequence contigs. The total amount of artefactual duplication varies among chromosomes; the overall amount is estimated by computational analysis to be about 100 Mb, or about 3% of the total
length given, yielding a total estimated size of about 3,200 Mb for the human genome.
³³ Including heterochromatic regions and acrocentric short arm(s)
105
.
© 2001 Macmillan Magazines Ltd
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 873
Table 7 Sequence level contiguity of the draft genome sequence
Chromosome Initial sequence contigs Sequence contigs Sequence-contig scaffolds
Number N50 length (kb) Number N50 length (kb) Number N50 length (kb)
All 396,913 21.7 149,821 81.9 87,757 274.3
1 37,656 16.5 12,256 59.1 5,457 278.4
2 32,280 19.9 13,228 57.3 6,959 248.5
3 38,848 15.6 15,098 37.7 8,964 167.4
4 28,600 16.0 13,152 33.0 7,402 158.9
5 30,096 20.4 10,689 72.9 6,378 241.2
6 17,472 43.6 5,547 180.3 2,554 485.0
7 12,733 86.4 4,562 335.7 2,726 591.3
8 19,042 18.1 8,984 38.2 4,631 198.9
9 15,955 20.1 6,226 55.6 3,766 216.2
10 21,762 18.7 9,126 47.9 6,886 133.0
11 29,723 14.3 8,503 40.0 4,684 193.2
12 22,050 19.1 8,422 63.4 5,526 217.0
13 13,737 21.7 5,193 70.5 2,659 300.1
14 4,470 161.4 829 1,371.0 541 2,009.5
15 13,134 15.3 5,840 30.3 3,229 149.7
16 10,297 34.4 4,916 119.5 3,337 356.3
17 10,369 22.9 4,339 90.6 2,616 248.9
18 16,266 15.3 4,461 51.4 2,540 216.1
19 6,009 38.4 2,503 134.4 1,551 375.5
20 2,884 108.6 511 1,346.7 312 813.8
21 103 340.0 5 28,515.3 5 28,515.3
22 526 113.9 11 23,048.1 11 23,048.1
X 11,062 58.8 4,607 218.6 2,610 450.7
Y 557 154.3 140 1,388.6 106 1,439.7
UL 1,282 21.4 613 46.0 297 166.4
...................................................................................................................................................................................................................................................................................................................................................................
This Table is similar to Table 6 but shows the number and N50 length for various types of sequence contig (see Box 1). See legend to Table 6 concerning treatment of gaps. For sequence contigs in the draft
genome sequence, the N50 length ranges from 1.7 to 5.5 times the arithmetic mean for initial sequence contigs, 2.5 to 8.2 times for merged sequence contigs, and 6.1 to 10 times for sequence-contig
scaffolds.
Table 8 Chromosome size estimates
Chromosome* Sequenced
bases² (Mb)
FCC gaps³ SCC gapsk Sequence gaps# Heterochromatin
and short arm
adjustments**(Mb)
Total estimated
chromosome size
(including
artefactual
duplication in draft
genome
sequence)
²²
(Mb)
Previously
estimated
chromosome
size
³³
(Mb)
Number Total bases
in gaps§ (Mb)
Number Total bases
in gaps¶ (Mb)
Number Total bases
in gaps
I
(Mb)
All 2,692.9 897 152.0 4,076 142.7 145,514 80.6 212 3,289 3,286
1 212.2 104 17.7 347 12.1 11,803 6.5 30 279 263
2 221.6 50 8.5 296 10.4 12,880 7.1 3 251 255
3 186.2 71 12.1 336 11.8 14,689 8.1 3 221 214
4 168.1 39 6.6 343 12.0 12,768 7.1 3 197 203
5 169.7 46 7.8 337 11.8 10,304 5.7 3 198 194
6 158.1 15 2.6 275 9.6 5,225 2.9 3 176 183
7 146.2 27 4.6 195 6.8 4,338 2.4 3 163 171
8 124.3 41 7.0 249 8.7 8,692 4.8 3 148 155
9 106.9 19 3.2 122 4.3 6,083 3.4 22 140 145
10 127.1 14 2.4 163 5.7 8,947 5.0 3 143 144
11 128.6 29 4.9 193 6.8 8,279 4.6 3 148 144
12 124.5 26 4.4 168 5.9 8,226 4.6 3 142 143
13 92.9 12 2.0 115 4.0 5,065 2.8 16 118 114
14 86.9 13 2.2 40 1.4 775 0.4 16 107 109
15 73.4 18 3.1 104 3.6 5,717 3.2 17 100 106
16 73.1 55 9.4 102 3.6 4,757 2.6 15 104 98
17 72.8 41 7.0 95 3.3 4,261 2.4 3 88 92
18 72.9 22 3.7 113 4.0 4,324 2.4 3 86 85
19 55.4 49 8.3 108 3.8 2,344 1.3 3 72 67
20 60.5 7 1.2 33 1.2 469 0.3 3 66 72
21 33.8 4 0.1 0 0.0 0 0.0 11 45 50
22 33.8 10 1.0 0 0.0 0 0.0 13 48 56
X 127.7 141 24.0 182 6.4 4,282 2.4 3 163 164
Y 21.8 6 1.0 19 0.7 113 0.1 27 51 59
NA 5.1 0 0 134 0.0 577 0.3 0 0 0
UL 9.3 38 0 7 0.0 566 0.3 0 0 0
...................................................................................................................................................................................................................................................................................................................................................................
* NA, sequenced clones that could not be associated with ®ngerprint clone contigs. UL, clone contigs that could not be reliably placed on a chromosome.
² Total number of bases in the draft genome sequence, excluding gaps. Total length of scaffold (including gaps contained within clones) is 2.916 Gb.
³ Gaps between those ®ngerprint clone contigs that contain sequenced clones excluding gaps for centromeres.
§ For un®nished chromosomes, we estimate an average size of 0.17 Mb per FCC gap, based on retrospective estimates of the clone coverage of chromosomes 21 and 22. Gap estimates for chromosomes
21 and 22 are taken from refs 93, 94.
kGaps between sequenced-clone contigs within a ®ngerprint clone contig.
¶ For un®nished chromosomes, we estimate sequenced clone gaps at 0.035 Mb each, based on evaluation of a sample of these gaps.
# Gaps between two sequence contigs within a sequenced-clone contig.
I
We estimate the average number of bases in sequence gaps from alignments of the initial sequence contigs of un®nished clones (see text) and extrapolation to the whole chromosome.
** Including adjustments for estimates of the sizes of the short arms of the acrocentric chromosomes 13, 14, 15, 21 and 22 (ref. 105), estimates for the centromere and heterochromatic regions of
chromosomes 1, 9 and 16 (refs 106, 107) and estimates of 3 Mb for the centromere and 24 Mb for telomeric heterochromatin for the Y chromosome
108
.
²² The sum of the ®ve lengths in the preceding columns. This is an overestimate, because the draft genome sequence contains some artefactual sequence owing to inability to correctly to merge all
underlying sequence contigs. The total amount of artefactual duplication varies among chromosomes; the overall amount is estimated by computational analysis to be about 100 Mb, or about 3% of the total
length given, yielding a total estimated size of about 3,200 Mb for the human genome.
³³ Including heterochromatic regions and acrocentric short arm(s)
105
.
© 2001 Macmillan Magazines Ltd
Page 15
®ngerprint map. However, many involve STSs that have been
localized on only one or two of the previous maps or that occur
as isolated discrepancies in con¯ict with several ¯anking STSs.
Many of these cases are probably due to errors in the previous
maps (with error rates for individual maps estimated at 1±2%
100
).
Others may be due to incorrect assignment of the STSs to the draft
genome sequence (by the electronic polymerase chain reaction
(e-PCR) computer program) or to database entries that contain
sequence data from more than one clone (owing to cross-
contamination).
Graphical views of the independent data sets were particularly
useful in detecting problems with order or orientation (Fig. 5).
Areas of con¯ict were reviewed and corrected if supported by the
underlying data. In the version discussed here, there were 41
sequenced clones falling in 14 sequenced-clone contigs with STS
content information from multiple maps that disagreed with the
¯anking clones or sequenced-clone contigs; the placement of these
clones thus remains suspect. Four of these instances suggest errors
in the ®ngerprint map, whereas the others suggest errors in the
layout of sequenced clones. These cases are being investigated and
will be corrected in future versions.
Assembly of the sequenced clones. We assessed the accuracy of the
assembly by using a set of 148 draft clones comprising 22.4 Mb for
which ®nished sequence subsequently became available
104
. The
initial sequence contigs lack information about order and orienta-
tion, and GigAssembler attempts to use linking data to infer such
information as far as possible
104
. Starting with initial sequence
contigs that were unordered and unoriented, the program placed
90% of the initial sequence contigs in the correct orientation and
85% in the correct order with respect to one another. In a separate
test, GigAssembler was tested on simulated draft data produced
from ®nished sequence on chromosome 22 and similar results were
obtained.
Some problems remain at all levels. First, errors in the initial
sequence contigs persist in the merged sequence contigs built from
them and can cause dif®culties in the assembly of the draft genome
sequence. Second, GigAssembler may fail to merge some over-
lapping sequences because of poor data quality, allelic differences or
misassemblies of the initial sequence contigs; this may result in
apparent local duplication of a sequence. We have estimated by
various methods the amount of such artefactual duplication in the
assembly from these and other sources to be about 100 Mb. On the
other hand, nearby duplicated sequences may occasionally be incor-
rectly merged. Some sequenced clones remain incorrectly placed on
the layout, as discussed above, and others (, 0.5%) remain unplaced.
The ®ngerprint map has undoubtedly failed to resolve some closely
related duplicated regions, such as the Williams region and several
highly repetitive subtelomeric and pericentric regions (see below).
Detailed examination and sequence ®nishing may be required to
sort out these regions precisely, as has been done with chromosome
Y
89
. Finally, small sequenced-clone contigs with limited or no STS
landmark content remain dif®cult to place. Full utilization of
the higher resolution radiation hybrid map (the TNG map) may
help in this
95
. Future targeted FISH experiments and increased map
continuity will also facilitate positioning of these sequences.
Genome coverage
We next assessed the nature of the gaps within the draft genome
sequence, and attempted to estimate the fraction of the human
genome not represented within the current version.
Gaps in draft genome sequence coverage. There are three types of
gap in the draft genome sequence: gaps within un®nished
sequenced clones; gaps between sequenced-clone contigs, but
within ®ngerprint clone contigs; and gaps between ®ngerprint
clone contigs. The ®rst two types are relatively straightforward to
close simply by performing additional sequencing and ®nishing on
already identi®ed clones. Closing the third type may require screen-
ing of additional large-insert clone libraries and possibly new
technologies for the most recalcitrant regions. We consider these
three cases in turn.
We estimated the size of gaps within draft clones by studying
instances in which there was substantial overlap between a draft
clone and a ®nished clone, as described above. The average gap size
in these draft sequenced clones was 554 bp, although the precise
estimate was sensitive to certain assumptions in the analysis.
Assuming that the sequence gaps in the draft genome sequence
are fairly represented by this sample, about 80 Mb or about 3%
(likely range 2±4%) of sequence may lie in the 145,514 gaps within
draft sequenced clones.
The gaps between sequenced-clone contigs but within ®ngerprint
clone contigs are more dif®cult to evaluate directly, because the
draft genome sequence ¯anking many of the gaps is often not
precisely aligned with the ®ngerprinted clones. However, most are
much smaller than a single BAC. In fact, nearly three-quarters of
these gaps are bridged by one or more individual BACs, as indicated
by linking information from BAC end sequences. We measured the
sizes of a subset of gaps directly by examining restriction fragment
®ngerprints of overlapping clones. A study of 157 `bridged' gaps and
55 `unbridged' gaps gave an average gap size of 25 kb. Allowing for the
possibility that these gaps may not be fully representative and that
some restriction fragments are not included in the calculation, a more
conservative estimate of gap size would be 35 kb. This would indicate
that about 150 Mb or 5% of the human genome may reside in the
4,076 gaps between sequenced-clone contigs. This sequence should
be readily obtained as the clones spanning them are sequenced.
The size of the gaps between ®ngerprint clone contigs was
estimated by comparing the ®ngerprint maps to the essentially
completed chromosomes 21 and 22. The analysis shows that the
®ngerprinted BAC clones in the global database cover 97±98% of
the sequenced portions of those chromosomes
86
. The published
sequences of these chromosomes also contain a few small gaps (5
and 11, respectively) amounting to some 1.6% of the euchromatic
sequence, and do not include the heterochromatic portion. This
suggests that the gaps between contigs in the ®ngerprint map
contain about 4% of the euchromatic genome. Experience with
closure of such gaps on chromosomes 20 and 7 suggests that many
of these gaps are less than one clone in length and will be closed by
clones from other libraries. However, recovery of sequence from
these gaps represents the most challenging aspect of producing a
complete ®nished sequence of the human genome.
As another measure of the representation of the BAC libraries,
Riethman
109
has found BAC or cosmid clones that link to telomeric
half-YACs or to the telomeric sequence itself for 40 of the 41 non-
satellite telomeres. Thus, the ®ngerprint map appears to have no
substantial gaps in these regions. Many of the pericentric regions are
also represented, but analysis is less complete here (see below).
Representation of random raw sequences. In another approach to
measuring coverage, we compared a collection of random raw
sequence reads to the existing draft genome sequence. In principle,
articles
874 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 9 Distribution of PHRAP scores in the draft genome sequence
PHRAP score Percentage of bases in the draft
genome sequence
0±9 0.6
10±19 1.3
20±29 2.2
30±39 4.8
40±49 8.1
50±59 8.7
60±69 9.0
70±79 12.1
80±89 17.3
.90 35.9
.............................................................................................................................................................................
PHRAP scores are a logarithmically based representation of the error probability. A PHRAP score of
X corresponds to an error probability of 10
-X/10
. Thus, PHRAP scores of 20, 30 and 40 correspond to
accuracy of 99%, 99.9% and 99.99%, respectively. PHRAP scores are derived from quality
scores of the underlying sequence reads used in sequence assembly. See http://www.genome.
washington.edu/UWGC/analysistools/phrap.htm.
© 2001 Macmillan Magazines Ltd
localized on only one or two of the previous maps or that occur
as isolated discrepancies in con¯ict with several ¯anking STSs.
Many of these cases are probably due to errors in the previous
maps (with error rates for individual maps estimated at 1±2%
100
).
Others may be due to incorrect assignment of the STSs to the draft
genome sequence (by the electronic polymerase chain reaction
(e-PCR) computer program) or to database entries that contain
sequence data from more than one clone (owing to cross-
contamination).
Graphical views of the independent data sets were particularly
useful in detecting problems with order or orientation (Fig. 5).
Areas of con¯ict were reviewed and corrected if supported by the
underlying data. In the version discussed here, there were 41
sequenced clones falling in 14 sequenced-clone contigs with STS
content information from multiple maps that disagreed with the
¯anking clones or sequenced-clone contigs; the placement of these
clones thus remains suspect. Four of these instances suggest errors
in the ®ngerprint map, whereas the others suggest errors in the
layout of sequenced clones. These cases are being investigated and
will be corrected in future versions.
Assembly of the sequenced clones. We assessed the accuracy of the
assembly by using a set of 148 draft clones comprising 22.4 Mb for
which ®nished sequence subsequently became available
104
. The
initial sequence contigs lack information about order and orienta-
tion, and GigAssembler attempts to use linking data to infer such
information as far as possible
104
. Starting with initial sequence
contigs that were unordered and unoriented, the program placed
90% of the initial sequence contigs in the correct orientation and
85% in the correct order with respect to one another. In a separate
test, GigAssembler was tested on simulated draft data produced
from ®nished sequence on chromosome 22 and similar results were
obtained.
Some problems remain at all levels. First, errors in the initial
sequence contigs persist in the merged sequence contigs built from
them and can cause dif®culties in the assembly of the draft genome
sequence. Second, GigAssembler may fail to merge some over-
lapping sequences because of poor data quality, allelic differences or
misassemblies of the initial sequence contigs; this may result in
apparent local duplication of a sequence. We have estimated by
various methods the amount of such artefactual duplication in the
assembly from these and other sources to be about 100 Mb. On the
other hand, nearby duplicated sequences may occasionally be incor-
rectly merged. Some sequenced clones remain incorrectly placed on
the layout, as discussed above, and others (, 0.5%) remain unplaced.
The ®ngerprint map has undoubtedly failed to resolve some closely
related duplicated regions, such as the Williams region and several
highly repetitive subtelomeric and pericentric regions (see below).
Detailed examination and sequence ®nishing may be required to
sort out these regions precisely, as has been done with chromosome
Y
89
. Finally, small sequenced-clone contigs with limited or no STS
landmark content remain dif®cult to place. Full utilization of
the higher resolution radiation hybrid map (the TNG map) may
help in this
95
. Future targeted FISH experiments and increased map
continuity will also facilitate positioning of these sequences.
Genome coverage
We next assessed the nature of the gaps within the draft genome
sequence, and attempted to estimate the fraction of the human
genome not represented within the current version.
Gaps in draft genome sequence coverage. There are three types of
gap in the draft genome sequence: gaps within un®nished
sequenced clones; gaps between sequenced-clone contigs, but
within ®ngerprint clone contigs; and gaps between ®ngerprint
clone contigs. The ®rst two types are relatively straightforward to
close simply by performing additional sequencing and ®nishing on
already identi®ed clones. Closing the third type may require screen-
ing of additional large-insert clone libraries and possibly new
technologies for the most recalcitrant regions. We consider these
three cases in turn.
We estimated the size of gaps within draft clones by studying
instances in which there was substantial overlap between a draft
clone and a ®nished clone, as described above. The average gap size
in these draft sequenced clones was 554 bp, although the precise
estimate was sensitive to certain assumptions in the analysis.
Assuming that the sequence gaps in the draft genome sequence
are fairly represented by this sample, about 80 Mb or about 3%
(likely range 2±4%) of sequence may lie in the 145,514 gaps within
draft sequenced clones.
The gaps between sequenced-clone contigs but within ®ngerprint
clone contigs are more dif®cult to evaluate directly, because the
draft genome sequence ¯anking many of the gaps is often not
precisely aligned with the ®ngerprinted clones. However, most are
much smaller than a single BAC. In fact, nearly three-quarters of
these gaps are bridged by one or more individual BACs, as indicated
by linking information from BAC end sequences. We measured the
sizes of a subset of gaps directly by examining restriction fragment
®ngerprints of overlapping clones. A study of 157 `bridged' gaps and
55 `unbridged' gaps gave an average gap size of 25 kb. Allowing for the
possibility that these gaps may not be fully representative and that
some restriction fragments are not included in the calculation, a more
conservative estimate of gap size would be 35 kb. This would indicate
that about 150 Mb or 5% of the human genome may reside in the
4,076 gaps between sequenced-clone contigs. This sequence should
be readily obtained as the clones spanning them are sequenced.
The size of the gaps between ®ngerprint clone contigs was
estimated by comparing the ®ngerprint maps to the essentially
completed chromosomes 21 and 22. The analysis shows that the
®ngerprinted BAC clones in the global database cover 97±98% of
the sequenced portions of those chromosomes
86
. The published
sequences of these chromosomes also contain a few small gaps (5
and 11, respectively) amounting to some 1.6% of the euchromatic
sequence, and do not include the heterochromatic portion. This
suggests that the gaps between contigs in the ®ngerprint map
contain about 4% of the euchromatic genome. Experience with
closure of such gaps on chromosomes 20 and 7 suggests that many
of these gaps are less than one clone in length and will be closed by
clones from other libraries. However, recovery of sequence from
these gaps represents the most challenging aspect of producing a
complete ®nished sequence of the human genome.
As another measure of the representation of the BAC libraries,
Riethman
109
has found BAC or cosmid clones that link to telomeric
half-YACs or to the telomeric sequence itself for 40 of the 41 non-
satellite telomeres. Thus, the ®ngerprint map appears to have no
substantial gaps in these regions. Many of the pericentric regions are
also represented, but analysis is less complete here (see below).
Representation of random raw sequences. In another approach to
measuring coverage, we compared a collection of random raw
sequence reads to the existing draft genome sequence. In principle,
articles
874 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 9 Distribution of PHRAP scores in the draft genome sequence
PHRAP score Percentage of bases in the draft
genome sequence
0±9 0.6
10±19 1.3
20±29 2.2
30±39 4.8
40±49 8.1
50±59 8.7
60±69 9.0
70±79 12.1
80±89 17.3
.90 35.9
.............................................................................................................................................................................
PHRAP scores are a logarithmically based representation of the error probability. A PHRAP score of
X corresponds to an error probability of 10
-X/10
. Thus, PHRAP scores of 20, 30 and 40 correspond to
accuracy of 99%, 99.9% and 99.99%, respectively. PHRAP scores are derived from quality
scores of the underlying sequence reads used in sequence assembly. See http://www.genome.
washington.edu/UWGC/analysistools/phrap.htm.
© 2001 Macmillan Magazines Ltd
Page 16
the fraction of reads matching the draft genome sequence should
provide an estimate of genome coverage. In practice, the compar-
ison is complicated by the need to allow for repeat sequences, the
imperfect sequence quality of both the raw sequence and the draft
genome sequence, and the possibility of polymorphism. None-
theless, the analysis provides a reasonable view of the extent to
which the genome is represented in the draft genome sequence and
the public databases.
We compared the raw sequence reads against both the sequences
used in the construction of the draft genome sequence and all of
GenBank using the BLAST computer program. Of the 5,615 raw
sequence reads analysed (each containing at least 100 bp of con-
tiguous non-repetitive sequence), 4,924 had a match of $ 97%
identity with a sequenced clone, indicating that 88 6 1.5% of the
genome was represented in sequenced clones. The estimate is
subject to various uncertainties. Most serious is the proportion of
repeat sequence in the remainder of the genome. If the unsequenced
portion of the genome is unusually rich in repeated sequence,
we would underestimate its size (although the excess would be
comprised of repeated sequence).
We examined those raw sequences that failed to match by
comparing them to the other publicly available sequence resources.
Fifty (0.9%) had matches in public databases containing cDNA
sequences, STSs and similar data. An additional 276 (or 43% of the
remaining raw sequence) had matches to the whole-genome shot-
gun reads discussed above (consistent with the idea that these reads
cover about half of the genome).
We also examined the extent of genome coverage by aligning the
cDNA sequences for genes in the RefSeq dataset
110
to the draft
genome sequence. We found that 88% of the bases of these cDNAs
could be aligned to the draft genome sequence at high stringency (at
least 98% identity). (A few of the alignments with either the random
raw sequence reads or the cDNAs may be to a highly similar region
in the genome, but such matches should affect the estimate of
genome coverage by considerably less than 1%, based on the
estimated extent of duplication within the genome (see below).)
These results indicate that about 88% of the human genome is
represented in the draft genome sequence and about 94% in the
combined publicly available sequence databases. The ®gure of 88%
agrees well with our independent estimates above that about 3%,
5% and 4% of the genome reside in the three types of gap in the draft
genome sequence.
Finally, a small experimental check was performed by screening a
large-insert clone library with probes corresponding to 16 of the
whole genome shotgun reads that failed to match the draft genome
sequence. Five hybridized to many clones from different ®ngerprint
clone contigs and were discarded as being repetitive. Of the
remaining eleven, two fell within sequenced clones (presumably
within sequence gaps of the ®rst type), eight fell in ®ngerprint clone
contigs but between sequenced clones (gaps of the second type) and
one failed to identify clones in the ®ngerprint map (gaps of the third
type) but did identify clones in another large-insert library.
Although these numbers are small, they are consistent with the
view that the much of the remaining genome sequence lies within
already identi®ed clones in the current map.
Estimates of genome and chromosome sizes. Informed by this
analysis of genome coverage, we proceeded to estimate the sizes of
the genome and each of the chromosomes (Table 8). Beginning with
the current assigned sequence for each chromosome, we corrected
for the known gaps on the basis of their estimated sizes (see
above). We attempted to account for the sizes of centromeres and
heterochromatin, neither of which are well represented in the draft
sequence. Finally, we corrected for around 100 Mb of artefactual
duplication in the assembly. We arrived at a total human genome
size estimate of around 3,200 Mb, which compares favourably with
previous estimates based on DNA content.
We also independently estimated the size of the euchromatic
portion of the genome by determining the fraction of the 5,615
random raw sequences that matched the ®nished portion of
the human genome (whose total length is known with greater
precision). Twenty-nine per cent of these raw sequences found a
match among 835 Mb of nonredundant ®nished sequence. This
leads to an estimate of the euchromatic genome size of 2.9 Gb. This
agrees reasonably with the prediction above based on the length of
the draft genome sequence (Table 8).
Update. The results above re¯ect the data on 7 October 2000. New
data are continually being added, with improvements being made to
the physical map, new clones being sequenced to close gaps and
draft clones progressing to full shotgun coverage and ®nishing. The
draft genome sequence will be regularly reassembled and publicly
released.
Currently, the physical map has been re®ned such that the
number of ®ngerprint clone contigs has fallen from 1,246 to 965;
this re¯ects the elimination of some artefactual contigs and the
closure of some gaps. The sequence coverage has risen such that
90% of the human genome is now represented in the sequenced
clones and more than 94% is represented in the combined publicly
available sequence databases. The total amount of ®nished sequence
is now around 1 Gb.
Broad genomic landscape
What biological insights can be gleaned from the draft sequence? In
this section, we consider very large-scale features of the draft
genome sequence: the distribution of GC content, CpG islands
and recombination rates, and the repeat content and gene content of
the human genome. The draft genome sequence makes it possible to
integrate these features and others at scales ranging from individual
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 875
Figure 10 Screen shot from UCSC Draft Human Genome Browser. See
http://genome.ucsc.edu/.
Figure 11 Screen shot from the Genome Browser of Project Ensembl. See
http://www.ensembl.org.
© 2001 Macmillan Magazines Ltd
provide an estimate of genome coverage. In practice, the compar-
ison is complicated by the need to allow for repeat sequences, the
imperfect sequence quality of both the raw sequence and the draft
genome sequence, and the possibility of polymorphism. None-
theless, the analysis provides a reasonable view of the extent to
which the genome is represented in the draft genome sequence and
the public databases.
We compared the raw sequence reads against both the sequences
used in the construction of the draft genome sequence and all of
GenBank using the BLAST computer program. Of the 5,615 raw
sequence reads analysed (each containing at least 100 bp of con-
tiguous non-repetitive sequence), 4,924 had a match of $ 97%
identity with a sequenced clone, indicating that 88 6 1.5% of the
genome was represented in sequenced clones. The estimate is
subject to various uncertainties. Most serious is the proportion of
repeat sequence in the remainder of the genome. If the unsequenced
portion of the genome is unusually rich in repeated sequence,
we would underestimate its size (although the excess would be
comprised of repeated sequence).
We examined those raw sequences that failed to match by
comparing them to the other publicly available sequence resources.
Fifty (0.9%) had matches in public databases containing cDNA
sequences, STSs and similar data. An additional 276 (or 43% of the
remaining raw sequence) had matches to the whole-genome shot-
gun reads discussed above (consistent with the idea that these reads
cover about half of the genome).
We also examined the extent of genome coverage by aligning the
cDNA sequences for genes in the RefSeq dataset
110
to the draft
genome sequence. We found that 88% of the bases of these cDNAs
could be aligned to the draft genome sequence at high stringency (at
least 98% identity). (A few of the alignments with either the random
raw sequence reads or the cDNAs may be to a highly similar region
in the genome, but such matches should affect the estimate of
genome coverage by considerably less than 1%, based on the
estimated extent of duplication within the genome (see below).)
These results indicate that about 88% of the human genome is
represented in the draft genome sequence and about 94% in the
combined publicly available sequence databases. The ®gure of 88%
agrees well with our independent estimates above that about 3%,
5% and 4% of the genome reside in the three types of gap in the draft
genome sequence.
Finally, a small experimental check was performed by screening a
large-insert clone library with probes corresponding to 16 of the
whole genome shotgun reads that failed to match the draft genome
sequence. Five hybridized to many clones from different ®ngerprint
clone contigs and were discarded as being repetitive. Of the
remaining eleven, two fell within sequenced clones (presumably
within sequence gaps of the ®rst type), eight fell in ®ngerprint clone
contigs but between sequenced clones (gaps of the second type) and
one failed to identify clones in the ®ngerprint map (gaps of the third
type) but did identify clones in another large-insert library.
Although these numbers are small, they are consistent with the
view that the much of the remaining genome sequence lies within
already identi®ed clones in the current map.
Estimates of genome and chromosome sizes. Informed by this
analysis of genome coverage, we proceeded to estimate the sizes of
the genome and each of the chromosomes (Table 8). Beginning with
the current assigned sequence for each chromosome, we corrected
for the known gaps on the basis of their estimated sizes (see
above). We attempted to account for the sizes of centromeres and
heterochromatin, neither of which are well represented in the draft
sequence. Finally, we corrected for around 100 Mb of artefactual
duplication in the assembly. We arrived at a total human genome
size estimate of around 3,200 Mb, which compares favourably with
previous estimates based on DNA content.
We also independently estimated the size of the euchromatic
portion of the genome by determining the fraction of the 5,615
random raw sequences that matched the ®nished portion of
the human genome (whose total length is known with greater
precision). Twenty-nine per cent of these raw sequences found a
match among 835 Mb of nonredundant ®nished sequence. This
leads to an estimate of the euchromatic genome size of 2.9 Gb. This
agrees reasonably with the prediction above based on the length of
the draft genome sequence (Table 8).
Update. The results above re¯ect the data on 7 October 2000. New
data are continually being added, with improvements being made to
the physical map, new clones being sequenced to close gaps and
draft clones progressing to full shotgun coverage and ®nishing. The
draft genome sequence will be regularly reassembled and publicly
released.
Currently, the physical map has been re®ned such that the
number of ®ngerprint clone contigs has fallen from 1,246 to 965;
this re¯ects the elimination of some artefactual contigs and the
closure of some gaps. The sequence coverage has risen such that
90% of the human genome is now represented in the sequenced
clones and more than 94% is represented in the combined publicly
available sequence databases. The total amount of ®nished sequence
is now around 1 Gb.
Broad genomic landscape
What biological insights can be gleaned from the draft sequence? In
this section, we consider very large-scale features of the draft
genome sequence: the distribution of GC content, CpG islands
and recombination rates, and the repeat content and gene content of
the human genome. The draft genome sequence makes it possible to
integrate these features and others at scales ranging from individual
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 875
Figure 10 Screen shot from UCSC Draft Human Genome Browser. See
http://genome.ucsc.edu/.
Figure 11 Screen shot from the Genome Browser of Project Ensembl. See
http://www.ensembl.org.
© 2001 Macmillan Magazines Ltd
Page 17
nucleotides to collections of chromosomes. Unless noted, all ana-
lyses were conducted on the assembled draft genome sequence
described above.
Figure 9 provides a high-level view of the contents of the draft
genome sequence, at a scale of about 3.8 Mb per centimetre. Of
course, navigating information spanning nearly ten orders of
magnitude requires computational tools to extract the full value.
We have created and made freely available various `Genome Brow-
sers'. Browsers were developed and are maintained by the University
of California at Santa Cruz (Fig. 10) and the EnsEMBL project of the
European Bioinformatics Institute and the Sanger Centre (Fig. 11).
Additional browsers have been created; URLs are listed at
www.nhgri.nih.gov/genome_hub. These web-based computer
tools allow users to view an annotated display of the draft genome
sequence, with the ability to scroll along the chromosomes and
zoom in or out to different scales. They include: the nucleotide
sequence, sequence contigs, clone contigs, sequence coverage and
®nishing status, local GC content, CpG islands, known STS markers
from previous genetic and physical maps, families of repeat
sequences, known genes, ESTs and mRNAs, predicted genes, SNPs
and sequence similarities with other organisms (currently the
puffer®sh Tetraodon nigroviridis). These browsers will be updated
as the draft genome sequence is re®ned and corrected as additional
annotations are developed.
In addition to using the Genome Browsers, one can download
from these sites the entire draft genome sequence together with the
annotations in a computer-readable format. The sequences of the
underlying sequenced clones are all available through the public
sequence databases. URLs for these and other genome websites are
listed in Box 2. A larger list of useful URLs can be found at
www.nhgri.nih.gov/genome_hub. An introduction to using the
draft genome sequence, as well as associated databases and analy-
tical tools, is provided in an accompanying paper
111
.
In addition, the human cytogenetic map has been integrated with
the draft genome sequence as part of a related project. The BAC
Resource Consortium
103
established dense connections between the
maps using more than 7,500 sequenced large-insert clones that had
been cytogenetically mapped by FISH; the average density of the
map is 2.3 clones per Mb. Although the precision of the integration
is limited by the resolution of FISH, the links provide a powerful
tool for the analysis of cytogenetic aberrations in inherited diseases
and cancer. These cytogenetic links can also be accessed through the
Genome Browsers.
Long-range variation in GC content
The existence of GC-rich and GC-poor regions in the human
genome was ®rst revealed by experimental studies involving density
gradient separation, which indicated substantial variation in aver-
age GC content among large fragments. Subsequent studies have
indicated that these GC-rich and GC-poor regions may have
different biological properties, such as gene density, composition
of repeat sequences, correspondence with cytogenetic bands and
recombination rate
112±117
. Many of these studies were indirect, owing
to the lack of suf®cient sequence data.
The draft genome sequence makes it possible to explore the
variation in GC content in a direct and global manner. Visual
inspection (Fig. 9) con®rms that local GC content undergoes
substantial long-range excursions from its genome-wide average
of 41%. If the genome were drawn from a uniform distribution of
GC content, the local GC content in a window of size n bp should
be 41 6 Î((41)(59)/n)%. Fluctuations would be modest, with the
standard deviation being halved as the window size is quadrupledÐ
for example, 0.70%, 0.35%, 0.17% and 0.09% for windows of size 5,
20, 80 and 320 kb.
The draft genome sequence, however, contains many regions with
much more extreme variation. There are huge regions (. 10 Mb)
with GC content far from the average. For example, the most distal
48 Mb of chromosome 1p (from the telomere to about STS marker
D1S3279) has an average GC content of 47.1%, and chromosome 13
has a 40-Mb region (roughly between STS marker A005X38 and
articles
876 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
N
um
be
r o
f 2
0-
kb
w
in
do
w
s
0
20 25 30 35 40 45 50 55 60 65 70
2,000
4,000
6,000
8,000
10,000
12,000
GC content
Figure 12 Histogram of GC content of 20-kb windows in the draft genome sequence.
Box 2
Sources of publicly available sequence data and other relevant
genomic information
http://genome.ucsc.edu/
University of California at Santa Cruz
Contains the assembly of the draft genome sequence used in this paper and
updates
http://genome.wustl.edu/gsc/ human/Mapping/
Washington University
Contains links to clone and accession maps of the human genome
http://www.ensembl.org
EBI/Sanger Centre
Allows access to DNA and protein sequences with automatic baseline annotation
http://www.ncbi.nlm.nih.gov/ genome/guide/
NCBI
Views of chromosomes and maps and loci with links to other NCBI resources
http://www.ncbi.nlm.nih.gov/ genemap99/
Gene map 99: contains data and viewers for radiation hybrid maps of EST-based
STSs
http://compbio.ornl.gov/channel/index.html
Oak Ridge National Laboratory
Java viewers for human genome data
http://hgrep.ims.u-tokyo.ac.jp/
RIKEN and the University of Tokyo
Gives an overview of the entire human genome structure
http://snp.cshl.org/
The SNP Consortium
Includes a variety of ways to query for SNPs in the human genome
http://www.ncbi.nlm.nih.gov/Omim/
Online Mendelian Inheritance in Man
Contain information about human genes and disease
http://www.nhgri.nih.gov/ELSI/ and http://www.ornl.gov/hgmis/elsi/elsi.html
NHGRI and DOE
Contains information, links and articles on a wide range of social, ethical and legal
issues
© 2001 Macmillan Magazines Ltd
lyses were conducted on the assembled draft genome sequence
described above.
Figure 9 provides a high-level view of the contents of the draft
genome sequence, at a scale of about 3.8 Mb per centimetre. Of
course, navigating information spanning nearly ten orders of
magnitude requires computational tools to extract the full value.
We have created and made freely available various `Genome Brow-
sers'. Browsers were developed and are maintained by the University
of California at Santa Cruz (Fig. 10) and the EnsEMBL project of the
European Bioinformatics Institute and the Sanger Centre (Fig. 11).
Additional browsers have been created; URLs are listed at
www.nhgri.nih.gov/genome_hub. These web-based computer
tools allow users to view an annotated display of the draft genome
sequence, with the ability to scroll along the chromosomes and
zoom in or out to different scales. They include: the nucleotide
sequence, sequence contigs, clone contigs, sequence coverage and
®nishing status, local GC content, CpG islands, known STS markers
from previous genetic and physical maps, families of repeat
sequences, known genes, ESTs and mRNAs, predicted genes, SNPs
and sequence similarities with other organisms (currently the
puffer®sh Tetraodon nigroviridis). These browsers will be updated
as the draft genome sequence is re®ned and corrected as additional
annotations are developed.
In addition to using the Genome Browsers, one can download
from these sites the entire draft genome sequence together with the
annotations in a computer-readable format. The sequences of the
underlying sequenced clones are all available through the public
sequence databases. URLs for these and other genome websites are
listed in Box 2. A larger list of useful URLs can be found at
www.nhgri.nih.gov/genome_hub. An introduction to using the
draft genome sequence, as well as associated databases and analy-
tical tools, is provided in an accompanying paper
111
.
In addition, the human cytogenetic map has been integrated with
the draft genome sequence as part of a related project. The BAC
Resource Consortium
103
established dense connections between the
maps using more than 7,500 sequenced large-insert clones that had
been cytogenetically mapped by FISH; the average density of the
map is 2.3 clones per Mb. Although the precision of the integration
is limited by the resolution of FISH, the links provide a powerful
tool for the analysis of cytogenetic aberrations in inherited diseases
and cancer. These cytogenetic links can also be accessed through the
Genome Browsers.
Long-range variation in GC content
The existence of GC-rich and GC-poor regions in the human
genome was ®rst revealed by experimental studies involving density
gradient separation, which indicated substantial variation in aver-
age GC content among large fragments. Subsequent studies have
indicated that these GC-rich and GC-poor regions may have
different biological properties, such as gene density, composition
of repeat sequences, correspondence with cytogenetic bands and
recombination rate
112±117
. Many of these studies were indirect, owing
to the lack of suf®cient sequence data.
The draft genome sequence makes it possible to explore the
variation in GC content in a direct and global manner. Visual
inspection (Fig. 9) con®rms that local GC content undergoes
substantial long-range excursions from its genome-wide average
of 41%. If the genome were drawn from a uniform distribution of
GC content, the local GC content in a window of size n bp should
be 41 6 Î((41)(59)/n)%. Fluctuations would be modest, with the
standard deviation being halved as the window size is quadrupledÐ
for example, 0.70%, 0.35%, 0.17% and 0.09% for windows of size 5,
20, 80 and 320 kb.
The draft genome sequence, however, contains many regions with
much more extreme variation. There are huge regions (. 10 Mb)
with GC content far from the average. For example, the most distal
48 Mb of chromosome 1p (from the telomere to about STS marker
D1S3279) has an average GC content of 47.1%, and chromosome 13
has a 40-Mb region (roughly between STS marker A005X38 and
articles
876 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
N
um
be
r o
f 2
0-
kb
w
in
do
w
s
0
20 25 30 35 40 45 50 55 60 65 70
2,000
4,000
6,000
8,000
10,000
12,000
GC content
Figure 12 Histogram of GC content of 20-kb windows in the draft genome sequence.
Box 2
Sources of publicly available sequence data and other relevant
genomic information
http://genome.ucsc.edu/
University of California at Santa Cruz
Contains the assembly of the draft genome sequence used in this paper and
updates
http://genome.wustl.edu/gsc/ human/Mapping/
Washington University
Contains links to clone and accession maps of the human genome
http://www.ensembl.org
EBI/Sanger Centre
Allows access to DNA and protein sequences with automatic baseline annotation
http://www.ncbi.nlm.nih.gov/ genome/guide/
NCBI
Views of chromosomes and maps and loci with links to other NCBI resources
http://www.ncbi.nlm.nih.gov/ genemap99/
Gene map 99: contains data and viewers for radiation hybrid maps of EST-based
STSs
http://compbio.ornl.gov/channel/index.html
Oak Ridge National Laboratory
Java viewers for human genome data
http://hgrep.ims.u-tokyo.ac.jp/
RIKEN and the University of Tokyo
Gives an overview of the entire human genome structure
http://snp.cshl.org/
The SNP Consortium
Includes a variety of ways to query for SNPs in the human genome
http://www.ncbi.nlm.nih.gov/Omim/
Online Mendelian Inheritance in Man
Contain information about human genes and disease
http://www.nhgri.nih.gov/ELSI/ and http://www.ornl.gov/hgmis/elsi/elsi.html
NHGRI and DOE
Contains information, links and articles on a wide range of social, ethical and legal
issues
© 2001 Macmillan Magazines Ltd
Page 18
stsG30423) with only 36% GC content. There are also examples of
large shifts in GC content between adjacent multimegabase regions.
For example, the average GC content on chromosome 17q is 50%
for the distal 10.3 Mb but drops to 38% for the adjacent 3.9 Mb.
There are regions of less than 300 kb with even wider swings in GC
content, for example, from 33.1% to 59.3%.
Long-range variation in GC content is evident not just from
extreme outliers, but throughout the genome. The distribution of
average GC content in 20-kb windows across the draft genome
sequence is shown in Fig. 12. The spread is 15-fold larger than
predicted by a uniform process. Moreover, the standard deviation
barely decreases as window size increases by successive factors of
fourÐ5.9%, 5.2%, 4.9% and 4.6% for windows of size 5, 20, 80 and
320 kb. The distribution is also notably skewed, with 58% below the
average and 42% above the average of 41%, with a long tail of GC-
rich regions.
Bernardi and colleagues
118,119
proposed that the long-range varia-
tion in GC content may re¯ect that the genome is composed of a
mosaic of compositionally homogeneous regions that they dubbed
`isochores'. They suggested that the skewed distribution is com-
posed of ®ve normal distributions, corresponding to ®ve distinct
types of isochore (L1, L2, H1, H2 and H3, with GC contents of
, 38%, 38±42%, 42±47%, 47±52% and . 52%, respectively).
We studied the draft genome sequence to see whether strict
isochores could be identi®ed. For example, the sequence was
divided into 300-kb windows, and each window was subdivided
into 20-kb subwindows. We calculated the average GC content for
each window and subwindow, and investigated how much of the
variance in the GC content of subwindows across the genome can be
statistically `explained' by the average GC content in each window.
About three-quarters of the genome-wide variance among 20-kb
windows can be statistically explained by the average GC content of
300-kb windows that contain them, but the residual variance among
subwindows (standard deviation, 2.4%) is still far too large to be
consistent with a homogeneous distribution. In fact, the hypothesis
of homogeneity could be rejected for each 300-kb window in the
draft genome sequence.
Similar results were obtained with other window and subwindow
sizes. Some of the local heterogeneity in GC content is attributable to
transposable element insertions (see below). Such repeat elements
typically have a higher GC content than the surrounding sequence,
with the effect being strongest for the most recent insertions.
These results rule out a strict notion of isochores as composi-
tionally homogeneous. Instead, there is substantial variation at
many different scales, as illustrated in Fig. 13. Although isochores
do not appear to merit the pre®x `iso', the genome clearly does
contain large regions of distinctive GC content and it is likely to be
worth rede®ning the concept so that it becomes possible rigorously
to partition the genome into regions. In the absence of a precise
de®nition, we will loosely refer to such regions as `GC content
domains' in the context of the discussion below.
Fickett et al.
120
have explored a model in which the underlying
preference for a particular GC content drifts continuously through-
out the genome, an approach that bears further examination.
Churchill
121
has proposed that the boundaries between GC content
domains can in some cases be predicted by a hidden Markov model,
with one state representing a GC-rich region and one representing
an AT-rich region. We found that this approach tended to identify
only very short domains of less than a kilobase (data not shown),
but variants of this approach deserve further attention.
The correlation between GC content domains and various
biological properties is of great interest, and this is likely to be the
most fruitful route to understanding the basis of variation in GC
content. As described below, we con®rm the existence of strong
correlations with both repeat content and gene density. Using the
integration between the draft genome sequence and the cytogenetic
map described above, it is possible to con®rm a statistically
signi®cant correlation between GC content and Giemsa bands (G-
bands). For example, 98% of large-insert clones mapping to the
darkest G-bands are in 200-kb regions of low GC content (average
37%), whereas more than 80% of clones mapping to the lightest G-
bands are in regions of high GC content (average 45%)
103
. Estimated
band locations can be seen in Fig. 9 and viewed in the context of
other genome annotation at http://genome.ucsc.edu/goldenPath/
mapPlots/ and http://genome.ucsc.edu/goldenPath/hgTracks.html.
CpG islands
A related topic is the distribution of so-called CpG islands across the
genome. The dinucleotide CpG is notable because it is greatly
under-represented in human DNA, occurring at only about one-
®fth of the roughly 4% frequency that would be expected by simply
multiplying the typical fraction of Cs and Gs (0.21 ´ 0.21). The
de®cit occurs because most CpG dinucleotides are methylated on
the cytosine base, and spontaneous deamination of methyl-C
residues gives rise to T residues. (Spontaneous deamination of
ordinary cytosine residues gives rise to uracil residues that are
readily recognized and repaired by the cell.) As a result, methyl-
CpG dinucleotides steadily mutate to TpG dinucleotides. However,
the genome contains many `CpG islands' in which CpG dinucleo-
tides are not methylated and occur at a frequency closer to that
predicted by the local GC content. CpG islands are of particular
interest because many are associated with the 59 ends of genes
122±127
.
We searched the draft genome sequence for CpG islands. Ideally,
they should be de®ned by directly testing for the absence of cytosine
methylation, but that was not practical for this report. There are
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 877
60%-
50%-
40%-
30%-
20%-
60%-
50%-
40%-
30%-
20%-
60%-
50%-
40%-
30%-
20%-
100 Mb50 Mb0 Mb
10 Mb0 Mb
1 Mb0 Mb
5 Mb
0.5 Mb
Figure 13 Variation in GC content at various scales. The GC content in subregions of a
100-Mb region of chromosome 1 is plotted, starting at about 83 Mb from the beginning of
the draft genome sequence. This region is AT-rich overall. Top, the GC content of the
entire 100-Mb region analysed in non-overlapping 20-kb windows. Middle, GC content of
the ®rst 10 Mb, analysed in 2-kb windows. Bottom, GC content of the ®rst 1 Mb, analysed
in 200-bp windows. At this scale, gaps in the sequence can be seen.
© 2001 Macmillan Magazines Ltd
large shifts in GC content between adjacent multimegabase regions.
For example, the average GC content on chromosome 17q is 50%
for the distal 10.3 Mb but drops to 38% for the adjacent 3.9 Mb.
There are regions of less than 300 kb with even wider swings in GC
content, for example, from 33.1% to 59.3%.
Long-range variation in GC content is evident not just from
extreme outliers, but throughout the genome. The distribution of
average GC content in 20-kb windows across the draft genome
sequence is shown in Fig. 12. The spread is 15-fold larger than
predicted by a uniform process. Moreover, the standard deviation
barely decreases as window size increases by successive factors of
fourÐ5.9%, 5.2%, 4.9% and 4.6% for windows of size 5, 20, 80 and
320 kb. The distribution is also notably skewed, with 58% below the
average and 42% above the average of 41%, with a long tail of GC-
rich regions.
Bernardi and colleagues
118,119
proposed that the long-range varia-
tion in GC content may re¯ect that the genome is composed of a
mosaic of compositionally homogeneous regions that they dubbed
`isochores'. They suggested that the skewed distribution is com-
posed of ®ve normal distributions, corresponding to ®ve distinct
types of isochore (L1, L2, H1, H2 and H3, with GC contents of
, 38%, 38±42%, 42±47%, 47±52% and . 52%, respectively).
We studied the draft genome sequence to see whether strict
isochores could be identi®ed. For example, the sequence was
divided into 300-kb windows, and each window was subdivided
into 20-kb subwindows. We calculated the average GC content for
each window and subwindow, and investigated how much of the
variance in the GC content of subwindows across the genome can be
statistically `explained' by the average GC content in each window.
About three-quarters of the genome-wide variance among 20-kb
windows can be statistically explained by the average GC content of
300-kb windows that contain them, but the residual variance among
subwindows (standard deviation, 2.4%) is still far too large to be
consistent with a homogeneous distribution. In fact, the hypothesis
of homogeneity could be rejected for each 300-kb window in the
draft genome sequence.
Similar results were obtained with other window and subwindow
sizes. Some of the local heterogeneity in GC content is attributable to
transposable element insertions (see below). Such repeat elements
typically have a higher GC content than the surrounding sequence,
with the effect being strongest for the most recent insertions.
These results rule out a strict notion of isochores as composi-
tionally homogeneous. Instead, there is substantial variation at
many different scales, as illustrated in Fig. 13. Although isochores
do not appear to merit the pre®x `iso', the genome clearly does
contain large regions of distinctive GC content and it is likely to be
worth rede®ning the concept so that it becomes possible rigorously
to partition the genome into regions. In the absence of a precise
de®nition, we will loosely refer to such regions as `GC content
domains' in the context of the discussion below.
Fickett et al.
120
have explored a model in which the underlying
preference for a particular GC content drifts continuously through-
out the genome, an approach that bears further examination.
Churchill
121
has proposed that the boundaries between GC content
domains can in some cases be predicted by a hidden Markov model,
with one state representing a GC-rich region and one representing
an AT-rich region. We found that this approach tended to identify
only very short domains of less than a kilobase (data not shown),
but variants of this approach deserve further attention.
The correlation between GC content domains and various
biological properties is of great interest, and this is likely to be the
most fruitful route to understanding the basis of variation in GC
content. As described below, we con®rm the existence of strong
correlations with both repeat content and gene density. Using the
integration between the draft genome sequence and the cytogenetic
map described above, it is possible to con®rm a statistically
signi®cant correlation between GC content and Giemsa bands (G-
bands). For example, 98% of large-insert clones mapping to the
darkest G-bands are in 200-kb regions of low GC content (average
37%), whereas more than 80% of clones mapping to the lightest G-
bands are in regions of high GC content (average 45%)
103
. Estimated
band locations can be seen in Fig. 9 and viewed in the context of
other genome annotation at http://genome.ucsc.edu/goldenPath/
mapPlots/ and http://genome.ucsc.edu/goldenPath/hgTracks.html.
CpG islands
A related topic is the distribution of so-called CpG islands across the
genome. The dinucleotide CpG is notable because it is greatly
under-represented in human DNA, occurring at only about one-
®fth of the roughly 4% frequency that would be expected by simply
multiplying the typical fraction of Cs and Gs (0.21 ´ 0.21). The
de®cit occurs because most CpG dinucleotides are methylated on
the cytosine base, and spontaneous deamination of methyl-C
residues gives rise to T residues. (Spontaneous deamination of
ordinary cytosine residues gives rise to uracil residues that are
readily recognized and repaired by the cell.) As a result, methyl-
CpG dinucleotides steadily mutate to TpG dinucleotides. However,
the genome contains many `CpG islands' in which CpG dinucleo-
tides are not methylated and occur at a frequency closer to that
predicted by the local GC content. CpG islands are of particular
interest because many are associated with the 59 ends of genes
122±127
.
We searched the draft genome sequence for CpG islands. Ideally,
they should be de®ned by directly testing for the absence of cytosine
methylation, but that was not practical for this report. There are
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 877
60%-
50%-
40%-
30%-
20%-
60%-
50%-
40%-
30%-
20%-
60%-
50%-
40%-
30%-
20%-
100 Mb50 Mb0 Mb
10 Mb0 Mb
1 Mb0 Mb
5 Mb
0.5 Mb
Figure 13 Variation in GC content at various scales. The GC content in subregions of a
100-Mb region of chromosome 1 is plotted, starting at about 83 Mb from the beginning of
the draft genome sequence. This region is AT-rich overall. Top, the GC content of the
entire 100-Mb region analysed in non-overlapping 20-kb windows. Middle, GC content of
the ®rst 10 Mb, analysed in 2-kb windows. Bottom, GC content of the ®rst 1 Mb, analysed
in 200-bp windows. At this scale, gaps in the sequence can be seen.
© 2001 Macmillan Magazines Ltd
Page 19
various computer programs that attempt to identify CpG islands on
the basis of primary sequence alone. These programs differ in some
important respects (such as how aggressively they subdivide long
CpG-containing regions), and the precise correspondence with
experimentally undermethylated islands has not been validated.
Nevertheless, there is a good correlation, and computational ana-
lysis thus provides a reasonable picture of the distribution of CpG
islands in the genome.
To identify CpG islands, we used the de®nition proposed by
Gardiner-Garden and Frommer
128
and embodied in a computer
program. We searched the draft genome sequence for CpG islands,
using both the full sequence and the sequence masked to eliminate
repeat sequences. The number of regions satisfying the de®nition of
a CpG island was 50,267 in the full sequence and 28,890 in the
repeat-masked sequence. The difference re¯ects the fact that some
repeat elements (notably Alu) are GC-rich. Although some of these
repeat elements may function as control regions, it seems unlikely
that most of the apparent CpG islands in repeat sequences are
functional. Accordingly, we focused on those in the non-repeated
sequence. The count of 28,890 CpG islands is reasonably close to the
previous estimate of about 35,000 (ref. 129, as modi®ed by ref. 130).
Most of the islands are short, with 60±70% GC content (Table 10).
More than 95% of the islands are less than 1,800 bp long, and more
than 75% are less than 850 bp. The longest CpG island (on
chromosome 10) is 36,619 bp long, and 322 are longer than 3,000
bp. Some of the larger islands contain ribosomal pseudogenes,
although RNA genes and pseudogenes account for only a small
proportion of all islands (, 0.5%). The smaller islands are consis-
tent with their previously hypothesized function, but the role of
these larger islands is uncertain.
The density of CpG islands varies substantially among some of
the chromosomes. Most chromosomes have 5±15 islands per Mb,
with a mean of 10.5 islands per Mb. However, chromosome Y has an
unusually low 2.9 islands per Mb, and chromosomes 16, 17 and 22
have 19±22 islands per Mb. The extreme outlier is chromosome 19,
with 43 islands per Mb. Similar trends are seen when considering the
percentage of bases contained in CpG islands. The relative density of
CpG islands correlates reasonably well with estimates of relative
gene density on these chromosomes, based both on previous
mapping studies involving ESTs (Fig. 14) and on the distribution
of gene predictions discussed below.
Comparison of genetic and physical distance
The draft genome sequence makes it possible to compare genetic
and physical distances and thereby to explore variation in the rate of
recombination across the human chromosomes. We focus here on
large-scale variation. Finer variation is examined in an accompany-
ing paper
131
.
The genetic and physical maps are integrated by 5,282 poly-
morphic loci from the Marsh®eld genetic map
102
, whose positions
are known in terms of centimorgans (cM) and Mb along the
chromosomes. Figure 15 shows the comparison of the draft
genome sequence for chromosome 12 with the male, female and
sex-averaged maps. One can calculate the approximate ratio of cM
per Mb across a chromosome (re¯ected in the slopes in Fig. 15) and
the average recombination rate for each chromosome arm.
Two striking features emerge from analysis of these data. First, the
average recombination rate increases as the length of the chromo-
some arm decreases (Fig. 16). Long chromosome arms have an
average recombination rate of about 1 cM per Mb, whereas the
shortest arms are in the range of 2 cM per Mb. A similar trend has
been seen in the yeast genome
132,133
, despite the fact that the physical
scale is nearly 200 times as small. Moreover, experimental studies
have shown that lengthening or shortening yeast chromosomes
results in a compensatory change in recombination rate
132
.
The second observation is that the recombination rate tends to be
suppressed near the centromeres and higher in the distal portions
of most chromosomes, with the increase largely in the terminal
articles
878 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 10 Number of CpG islands by GC content
GC content
of island
Number
of islands
Percentage
of islands
Nucleotides
in islands
Percentage of
nucleotides
in islands
Total 28,890 100 19,818,547 100
.80% 22 0.08 5,916 0.03
70±80% 5,884 20 3,111,965 16
60±70% 18,779 65 13,110,924 66
50±60% 4,205 15 3,589,742 18
.............................................................................................................................................................................
Potential CpG islands were identi®ed by searching the draft genome sequence one base at a time,
scoring each dinucleotide (+17 for GC, -1 for others) and identifying maximally scoring segments.
Each segment was then evaluated to determine GC content ($50%), length (.200) and ratio of
observed proportion of GC dinucleotides to the expected proportion on the basis of the GC content
of the segment (.0.60), using a modi®cation of a program developed by G. Micklem (personal
communication).
19
22
X
16
13 18
4
2
5 218
3
14 6 9
107
12
11 15
1
20
17
0
5
10
15
20
25
0 10 20 30 40 50
Number of CpG islands per Mb
N
um
be
r o
f g
en
es
p
er
M
b
Figure 14 Number of CpG islands per Mb for each chromosome, plotted against the
number of genes per Mb (the number of genes was taken from GeneMap98 (ref. 100)).
Chromosomes 16, 17, 22 and particularly 19 are clear outliers, with a density of CpG
islands that is even greater than would be expected from the high gene counts for these
four chromosomes.
10 20 30 40 50 60 70 80 90 100 110 120 130 140
60
0
Position (Mb)
D
is
ta
nc
e
fro
m
c
en
tr
om
er
e
(c
M
)
Centromere
Sex-averaged
Male
Female
50
40
30
20
10
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
12q
12p
Figure 15 Distance in cM along the genetic map of chromosome 12 plotted against
position in Mb in the draft genome sequence. Female, male and sex-averaged maps are
shown. Female recombination rates are much higher than male recombination rates. The
increased slopes at either end of the chromosome re¯ect the increased rates of
recombination per Mb near the telomeres. Conversely, the ¯atter slope near the
centromere shows decreased recombination there, especially in male meiosis. This is
typical of the other chromosomes as well (see http://genome.ucsc.edu/goldenPath/
mapPlots). Discordant markers may be map, marker placement or assembly errors.
© 2001 Macmillan Magazines Ltd
the basis of primary sequence alone. These programs differ in some
important respects (such as how aggressively they subdivide long
CpG-containing regions), and the precise correspondence with
experimentally undermethylated islands has not been validated.
Nevertheless, there is a good correlation, and computational ana-
lysis thus provides a reasonable picture of the distribution of CpG
islands in the genome.
To identify CpG islands, we used the de®nition proposed by
Gardiner-Garden and Frommer
128
and embodied in a computer
program. We searched the draft genome sequence for CpG islands,
using both the full sequence and the sequence masked to eliminate
repeat sequences. The number of regions satisfying the de®nition of
a CpG island was 50,267 in the full sequence and 28,890 in the
repeat-masked sequence. The difference re¯ects the fact that some
repeat elements (notably Alu) are GC-rich. Although some of these
repeat elements may function as control regions, it seems unlikely
that most of the apparent CpG islands in repeat sequences are
functional. Accordingly, we focused on those in the non-repeated
sequence. The count of 28,890 CpG islands is reasonably close to the
previous estimate of about 35,000 (ref. 129, as modi®ed by ref. 130).
Most of the islands are short, with 60±70% GC content (Table 10).
More than 95% of the islands are less than 1,800 bp long, and more
than 75% are less than 850 bp. The longest CpG island (on
chromosome 10) is 36,619 bp long, and 322 are longer than 3,000
bp. Some of the larger islands contain ribosomal pseudogenes,
although RNA genes and pseudogenes account for only a small
proportion of all islands (, 0.5%). The smaller islands are consis-
tent with their previously hypothesized function, but the role of
these larger islands is uncertain.
The density of CpG islands varies substantially among some of
the chromosomes. Most chromosomes have 5±15 islands per Mb,
with a mean of 10.5 islands per Mb. However, chromosome Y has an
unusually low 2.9 islands per Mb, and chromosomes 16, 17 and 22
have 19±22 islands per Mb. The extreme outlier is chromosome 19,
with 43 islands per Mb. Similar trends are seen when considering the
percentage of bases contained in CpG islands. The relative density of
CpG islands correlates reasonably well with estimates of relative
gene density on these chromosomes, based both on previous
mapping studies involving ESTs (Fig. 14) and on the distribution
of gene predictions discussed below.
Comparison of genetic and physical distance
The draft genome sequence makes it possible to compare genetic
and physical distances and thereby to explore variation in the rate of
recombination across the human chromosomes. We focus here on
large-scale variation. Finer variation is examined in an accompany-
ing paper
131
.
The genetic and physical maps are integrated by 5,282 poly-
morphic loci from the Marsh®eld genetic map
102
, whose positions
are known in terms of centimorgans (cM) and Mb along the
chromosomes. Figure 15 shows the comparison of the draft
genome sequence for chromosome 12 with the male, female and
sex-averaged maps. One can calculate the approximate ratio of cM
per Mb across a chromosome (re¯ected in the slopes in Fig. 15) and
the average recombination rate for each chromosome arm.
Two striking features emerge from analysis of these data. First, the
average recombination rate increases as the length of the chromo-
some arm decreases (Fig. 16). Long chromosome arms have an
average recombination rate of about 1 cM per Mb, whereas the
shortest arms are in the range of 2 cM per Mb. A similar trend has
been seen in the yeast genome
132,133
, despite the fact that the physical
scale is nearly 200 times as small. Moreover, experimental studies
have shown that lengthening or shortening yeast chromosomes
results in a compensatory change in recombination rate
132
.
The second observation is that the recombination rate tends to be
suppressed near the centromeres and higher in the distal portions
of most chromosomes, with the increase largely in the terminal
articles
878 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 10 Number of CpG islands by GC content
GC content
of island
Number
of islands
Percentage
of islands
Nucleotides
in islands
Percentage of
nucleotides
in islands
Total 28,890 100 19,818,547 100
.80% 22 0.08 5,916 0.03
70±80% 5,884 20 3,111,965 16
60±70% 18,779 65 13,110,924 66
50±60% 4,205 15 3,589,742 18
.............................................................................................................................................................................
Potential CpG islands were identi®ed by searching the draft genome sequence one base at a time,
scoring each dinucleotide (+17 for GC, -1 for others) and identifying maximally scoring segments.
Each segment was then evaluated to determine GC content ($50%), length (.200) and ratio of
observed proportion of GC dinucleotides to the expected proportion on the basis of the GC content
of the segment (.0.60), using a modi®cation of a program developed by G. Micklem (personal
communication).
19
22
X
16
13 18
4
2
5 218
3
14 6 9
107
12
11 15
1
20
17
0
5
10
15
20
25
0 10 20 30 40 50
Number of CpG islands per Mb
N
um
be
r o
f g
en
es
p
er
M
b
Figure 14 Number of CpG islands per Mb for each chromosome, plotted against the
number of genes per Mb (the number of genes was taken from GeneMap98 (ref. 100)).
Chromosomes 16, 17, 22 and particularly 19 are clear outliers, with a density of CpG
islands that is even greater than would be expected from the high gene counts for these
four chromosomes.
10 20 30 40 50 60 70 80 90 100 110 120 130 140
60
0
Position (Mb)
D
is
ta
nc
e
fro
m
c
en
tr
om
er
e
(c
M
)
Centromere
Sex-averaged
Male
Female
50
40
30
20
10
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
12q
12p
Figure 15 Distance in cM along the genetic map of chromosome 12 plotted against
position in Mb in the draft genome sequence. Female, male and sex-averaged maps are
shown. Female recombination rates are much higher than male recombination rates. The
increased slopes at either end of the chromosome re¯ect the increased rates of
recombination per Mb near the telomeres. Conversely, the ¯atter slope near the
centromere shows decreased recombination there, especially in male meiosis. This is
typical of the other chromosomes as well (see http://genome.ucsc.edu/goldenPath/
mapPlots). Discordant markers may be map, marker placement or assembly errors.
© 2001 Macmillan Magazines Ltd
Page 20
20±35 Mb. The increase is most pronounced in the male meiotic
map. The effect can be seen, for example, from the higher slope at
both ends of chromosome 12 (Fig. 15). Regional and sex-speci®c
effects have been observed for chromosome 21 (refs 110, 134).
Why is recombination higher on smaller chromosome arms? A
higher rate would increase the likelihood of at least one crossover
during meiosis on each chromosome arm, as is generally observed
in human chiasmata counts
135
. Crossovers are believed to be
necessary for normal meiotic disjunction of homologous chromo-
some pairs in eukaryotes. An extreme example is the pseudoauto-
somal regions on chromosomes Xp and Yp, which pair during male
meiosis; this physical region of only 2.6 Mb has a genetic length of
50 cM (corresponding to 20 cM per Mb), with the result that a
crossover is virtually assured.
Mechanistically, the increased rate of recombination on shorter
chromosome arms could be explained if, once an initial recombina-
tion event occurs, additional nearby events are blocked by positive
crossover interference on each arm. Evidence from yeast mutants in
which interference is abolished shows that interference plays a key
role in distributing a limited number of crossovers among the
various chromosome arms in yeast
136
. An alternative possibility is
that a checkpoint mechanism scans for and enforces the presence of
at least one crossover on each chromosome arm.
Variation in recombination rates along chromosomes and
between the sexes is likely to re¯ect variation in the initiation of
meiosis-induced double-strand breaks (DSBs) that initiate recom-
bination. DSBs in yeast have been associated with open
chromatin
137,138
, rather than with speci®c DNA sequence motifs.
With the availability of the draft genome sequence, it should be
possible to explore in an analogous manner whether variation
in human recombination rates re¯ects systematic differences in
chromosome accessibility during meiosis.
Repeat content of the human genome
A puzzling observation in the early days of molecular biology was
that genome size does not correlate well with organismal complex-
ity. For example, Homo sapiens has a genome that is 200 times as
large as that of the yeast S. cerevisiae, but 200 times as small as that of
Amoeba dubia
139,140
. This mystery (the C-value paradox) was largely
resolved with the recognition that genomes can contain a large
quantity of repetitive sequence, far in excess of that devoted to
protein-coding genes (reviewed in refs 140, 141).
In the human, coding sequences comprise less than 5% of the
genome (see below), whereas repeat sequences account for at least
50% and probably much more. Broadly, the repeats fall into ®ve
classes: (1) transposon-derived repeats, often referred to as inter-
spersed repeats; (2) inactive (partially) retroposed copies of cellular
genes (including protein-coding genes and small structural RNAs),
usually referred to as processed pseudogenes; (3) simple sequence
repeats, consisting of direct repetitions of relatively short k-mers
such as (A)
n
, (CA)
n
or (CGG)
n
; (4) segmental duplications, con-
sisting of blocks of around 10±300 kb that have been copied from
one region of the genome into another region; and (5) blocks of
tandemly repeated sequences, such as at centromeres, telomeres,
the short arms of acrocentric chromosomes and ribosomal gene
clusters. (These regions are intentionally under-represented in the
draft genome sequence and are not discussed here.)
Repeats are often described as `junk' and dismissed as uninterest-
ing. However, they actually represent an extraordinary trove of
information about biological processes. The repeats constitute a
rich palaeontological record, holding crucial clues about evolu-
tionary events and forces. As passive markers, they provide assays
for studying processes of mutation and selection. It is possible to
recognize cohorts of repeats `born' at the same time and to follow
their fates in different regions of the genome or in different species.
As active agents, repeats have reshaped the genome by causing
ectopic rearrangements, creating entirely new genes, modifying and
reshuf¯ing existing genes, and modulating overall GC content. They
also shed light on chromosome structure and dynamics, and
provide tools for medical genetic and population genetic studies.
The human is the ®rst repeat-rich genome to be sequenced, and
so we investigated what information could be gleaned from this
majority component of the human genome. Although some of the
general observations about repeats were suggested by previous
studies, the draft genome sequence provides the ®rst comprehensive
view, allowing some questions to be resolved and new mysteries to
emerge.
Transposon-derived repeats
Most human repeat sequence is derived from transposable
elements
142,143
. We can currently recognize about 45% of the
genome as belonging to this class. Much of the remaining
`unique' DNA must also be derived from ancient transposable
element copies that have diverged too far to be recognized as
such. To describe our analyses of interspersed repeats, it is necessary
brie¯y to review the relevant features of human transposable
elements.
Classes of transposable elements. In mammals, almost all trans-
posable elements fall into one of four types (Fig. 17), of which three
transpose through RNA intermediates and one transposes directly
as DNA. These are long interspersed elements (LINEs), short
interspersed elements (SINEs), LTR retrotransposons and DNA
transposons.
LINEs are one of the most ancient and successful inventions in
eukaryotic genomes. In humans, these transposons are about 6 kb
long, harbour an internal polymerase II promoter and encode two
open reading frames (ORFs). Upon translation, a LINE RNA
assembles with its own encoded proteins and moves to the nucleus,
where an endonuclease activity makes a single-stranded nick and
the reverse transcriptase uses the nicked DNA to prime reverse
transcription from the 39 end of the LINE RNA. Reverse transcrip-
tion frequently fails to proceed to the 59 end, resulting in many
truncated, nonfunctional insertions. Indeed, most LINE-derived
repeats are short, with an average size of 900 bp for all LINE1 copies,
and a median size of 1,070 bp for copies of the currently active
LINE1 element (L1Hs). New insertion sites are ¯anked by a small
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 879
0
0.5
1
1.5
2
2.5
3
0 20 40 60 80 100 120 140 160
Length of chromosome arm (Mb)
R
ec
om
bi
na
tio
n
ra
te
(c
M
p
er
M
b)
Figure 16 Rate of recombination averaged across the euchromatic portion of each
chromosome arm plotted against the length of the chromosome arm in Mb. For large
chromosomes, the average recombination rates are very similar, but as chromosome arm
length decreases, average recombination rates rise markedly.
© 2001 Macmillan Magazines Ltd
map. The effect can be seen, for example, from the higher slope at
both ends of chromosome 12 (Fig. 15). Regional and sex-speci®c
effects have been observed for chromosome 21 (refs 110, 134).
Why is recombination higher on smaller chromosome arms? A
higher rate would increase the likelihood of at least one crossover
during meiosis on each chromosome arm, as is generally observed
in human chiasmata counts
135
. Crossovers are believed to be
necessary for normal meiotic disjunction of homologous chromo-
some pairs in eukaryotes. An extreme example is the pseudoauto-
somal regions on chromosomes Xp and Yp, which pair during male
meiosis; this physical region of only 2.6 Mb has a genetic length of
50 cM (corresponding to 20 cM per Mb), with the result that a
crossover is virtually assured.
Mechanistically, the increased rate of recombination on shorter
chromosome arms could be explained if, once an initial recombina-
tion event occurs, additional nearby events are blocked by positive
crossover interference on each arm. Evidence from yeast mutants in
which interference is abolished shows that interference plays a key
role in distributing a limited number of crossovers among the
various chromosome arms in yeast
136
. An alternative possibility is
that a checkpoint mechanism scans for and enforces the presence of
at least one crossover on each chromosome arm.
Variation in recombination rates along chromosomes and
between the sexes is likely to re¯ect variation in the initiation of
meiosis-induced double-strand breaks (DSBs) that initiate recom-
bination. DSBs in yeast have been associated with open
chromatin
137,138
, rather than with speci®c DNA sequence motifs.
With the availability of the draft genome sequence, it should be
possible to explore in an analogous manner whether variation
in human recombination rates re¯ects systematic differences in
chromosome accessibility during meiosis.
Repeat content of the human genome
A puzzling observation in the early days of molecular biology was
that genome size does not correlate well with organismal complex-
ity. For example, Homo sapiens has a genome that is 200 times as
large as that of the yeast S. cerevisiae, but 200 times as small as that of
Amoeba dubia
139,140
. This mystery (the C-value paradox) was largely
resolved with the recognition that genomes can contain a large
quantity of repetitive sequence, far in excess of that devoted to
protein-coding genes (reviewed in refs 140, 141).
In the human, coding sequences comprise less than 5% of the
genome (see below), whereas repeat sequences account for at least
50% and probably much more. Broadly, the repeats fall into ®ve
classes: (1) transposon-derived repeats, often referred to as inter-
spersed repeats; (2) inactive (partially) retroposed copies of cellular
genes (including protein-coding genes and small structural RNAs),
usually referred to as processed pseudogenes; (3) simple sequence
repeats, consisting of direct repetitions of relatively short k-mers
such as (A)
n
, (CA)
n
or (CGG)
n
; (4) segmental duplications, con-
sisting of blocks of around 10±300 kb that have been copied from
one region of the genome into another region; and (5) blocks of
tandemly repeated sequences, such as at centromeres, telomeres,
the short arms of acrocentric chromosomes and ribosomal gene
clusters. (These regions are intentionally under-represented in the
draft genome sequence and are not discussed here.)
Repeats are often described as `junk' and dismissed as uninterest-
ing. However, they actually represent an extraordinary trove of
information about biological processes. The repeats constitute a
rich palaeontological record, holding crucial clues about evolu-
tionary events and forces. As passive markers, they provide assays
for studying processes of mutation and selection. It is possible to
recognize cohorts of repeats `born' at the same time and to follow
their fates in different regions of the genome or in different species.
As active agents, repeats have reshaped the genome by causing
ectopic rearrangements, creating entirely new genes, modifying and
reshuf¯ing existing genes, and modulating overall GC content. They
also shed light on chromosome structure and dynamics, and
provide tools for medical genetic and population genetic studies.
The human is the ®rst repeat-rich genome to be sequenced, and
so we investigated what information could be gleaned from this
majority component of the human genome. Although some of the
general observations about repeats were suggested by previous
studies, the draft genome sequence provides the ®rst comprehensive
view, allowing some questions to be resolved and new mysteries to
emerge.
Transposon-derived repeats
Most human repeat sequence is derived from transposable
elements
142,143
. We can currently recognize about 45% of the
genome as belonging to this class. Much of the remaining
`unique' DNA must also be derived from ancient transposable
element copies that have diverged too far to be recognized as
such. To describe our analyses of interspersed repeats, it is necessary
brie¯y to review the relevant features of human transposable
elements.
Classes of transposable elements. In mammals, almost all trans-
posable elements fall into one of four types (Fig. 17), of which three
transpose through RNA intermediates and one transposes directly
as DNA. These are long interspersed elements (LINEs), short
interspersed elements (SINEs), LTR retrotransposons and DNA
transposons.
LINEs are one of the most ancient and successful inventions in
eukaryotic genomes. In humans, these transposons are about 6 kb
long, harbour an internal polymerase II promoter and encode two
open reading frames (ORFs). Upon translation, a LINE RNA
assembles with its own encoded proteins and moves to the nucleus,
where an endonuclease activity makes a single-stranded nick and
the reverse transcriptase uses the nicked DNA to prime reverse
transcription from the 39 end of the LINE RNA. Reverse transcrip-
tion frequently fails to proceed to the 59 end, resulting in many
truncated, nonfunctional insertions. Indeed, most LINE-derived
repeats are short, with an average size of 900 bp for all LINE1 copies,
and a median size of 1,070 bp for copies of the currently active
LINE1 element (L1Hs). New insertion sites are ¯anked by a small
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 879
0
0.5
1
1.5
2
2.5
3
0 20 40 60 80 100 120 140 160
Length of chromosome arm (Mb)
R
ec
om
bi
na
tio
n
ra
te
(c
M
p
er
M
b)
Figure 16 Rate of recombination averaged across the euchromatic portion of each
chromosome arm plotted against the length of the chromosome arm in Mb. For large
chromosomes, the average recombination rates are very similar, but as chromosome arm
length decreases, average recombination rates rise markedly.
© 2001 Macmillan Magazines Ltd
Page 21
target site duplication of 7±20 bp. The LINE machinery is believed
to be responsible for most reverse transcription in the genome,
including the retrotransposition of the non-autonomous SINEs
144
and the creation of processed pseudogenes
145,146
. Three distantly
related LINE families are found in the human genome: LINE1,
LINE2 and LINE3. Only LINE1 is still active.
SINEs are wildly successful freeloaders on the backs of LINE
elements. They are short (about 100±400 bp), harbour an internal
polymerase III promoter and encode no proteins. These non-
autonomous transposons are thought to use the LINE machinery
for transposition. Indeed, most SINEs `live' by sharing the 39 end
with a resident LINE element
144
. The promoter regions of all known
SINEs are derived from tRNA sequences, with the exception of a
single monophyletic family of SINEs derived from the signal
recognition particle component 7SL. This family, which also does
not share its 39 end with a LINE, includes the only active SINE in the
human genome: the Alu element. By contrast, the mouse has both
tRNA-derived and 7SL-derived SINEs. The human genome con-
tains three distinct monophyletic families of SINEs: the active Alu,
and the inactive MIR and Ther2/MIR3.
LTR retroposons are ¯anked by long terminal direct repeats that
contain all of the necessary transcriptional regulatory elements. The
autonomous elements (retrotransposons) contain gag and pol
genes, which encode a protease, reverse transcriptase, RNAse H
and integrase. Exogenous retroviruses seem to have arisen from
endogenous retrotransposons by acquisition of a cellular envelope
gene (env)
147
. Transposition occurs through the retroviral mechan-
ism with reverse transcription occurring in a cytoplasmic virus-like
particle, primed by a tRNA (in contrast to the nuclear location and
chromosomal priming of LINEs). Although a variety of LTR retro-
transposons exist, only the vertebrate-speci®c endogenous retro-
viruses (ERVs) appear to have been active in the mammalian
genome. Mammalian retroviruses fall into three classes (I±III),
each comprising many families with independent origins. Most
(85%) of the LTR retroposon-derived `fossils' consist only of an
isolated LTR, with the internal sequence having been lost by
homologous recombination between the ¯anking LTRs.
DNA transposons resemble bacterial transposons, having term-
inal inverted repeats and encoding a transposase that binds near the
inverted repeats and mediates mobility through a `cut-and-paste'
mechanism. The human genome contains at least seven major
classes of DNA transposon, which can be subdivided into many
families with independent origins
148
(see RepBase, http://www.
girinst.org/,server/repbase.html). DNA transposons tend to have
short life spans within a species. This can be explained by contrast-
ing the modes of transposition of DNA transposons and LINE
elements. LINE transposition tends to involve only functional
elements, owing to the cis-preference by which LINE proteins
assemble with the RNA from which they were translated. By
contrast, DNA transposons cannot exercise a cis-preference: the
encoded transposase is produced in the cytoplasm and, when it
returns to the nucleus, it cannot distinguish active from inactive
elements. As inactive copies accumulate in the genome, transposi-
tion becomes less ef®cient. This checks the expansion of any DNA
transposon family and in due course causes it to die out. To survive,
DNA transposons must eventually move by horizontal transfer
to virgin genomes, and there is considerable evidence for such
transfer
149±153
.
Transposable elements employ different strategies to ensure their
evolutionary survival. LINEs and SINEs rely almost exclusively on
vertical transmission within the host genome
154
(but see refs 148,
155). DNA transposons are more promiscuous, requiring relatively
frequent horizontal transfer. LTR retroposons use both strategies,
with some being long-term active residents of the human genome
(such as members of the ERVL family) and others having only short
residence times.
articles
880 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
LINEs Autonomous 6–8 kb 850,000
Length Copy
number
SINEs Non-autonomous 100–300 bp 1,500,000
Retrovirus-like
elements
Autonomous 6–11 kb
450,000
Non-autonomous 1.5–3 kb
DNA
transposon
fossils
Autonomous 2–3 kb
300,000
21%
Fraction of
genome
13%
8%
3%
Non-autonomous 80–3,000 bp
ORF1 ORF2 (pol)
AAA
AAA
A B
gag pol (env)
(gag)
transposase
Classes of interspersed repeat in the human genome
Figure 17 Almost all transposable elements in mammals fall into one of four classes. See text for details.
Table 11 Number of copies and fraction of genome for classes of inter-
spersed repeat
Number of
copies (´ 1,000)
Total number of
bases in the draft
genome
sequence (Mb)
Fraction of the
draft genome
sequence (%)
Number of
families
(subfamilies)
SINEs 1,558 359.6 13.14
Alu 1,090 290.1 10.60 1 (,20)
MIR 393 60.1 2.20 1 (1)
MIR3 75 9.3 0.34 1 (1)
LINEs 868 558.8 20.42
LINE1 516 462.1 16.89 1 (,55)
LINE2 315 88.2 3.22 1 (2)
LINE3 37 8.4 0.31 1 (2)
LTR elements 443 227.0 8.29
ERV-class I 112 79.2 2.89 72 (132)
ERV(K)-class II 8 8.5 0.31 10 (20)
ERV (L)-class III 83 39.5 1.44 21 (42)
MaLR 240 99.8 3.65 1 (31)
DNA elements 294 77.6 2.84
hAT group
MER1-Charlie 182 38.1 1.39 25 (50)
Zaphod 13 4.3 0.16 4 (10)
Tc-1 group
MER2-Tigger 57 28.0 1.02 12 (28)
Tc2 4 0.9 0.03 1 (5)
Mariner 14 2.6 0.10 4 (5)
PiggyBac-like 2 0.5 0.02 10 (20)
Unclassi®ed 22 3.2 0.12 7 (7)
Unclassi®ed 3 3.8 0.14 3 (4)
Total interspersed
repeats
1,226.8 44.83
.............................................................................................................................................................................
The number of copies and base pair contributions of the major classes and subclasses of
transposable elements in the human genome. Data extracted from a RepeatMasker analysis of
the draft genome sequence (RepeatMasker version 09092000, sensitive settings, using RepBase
Update 5.08). In calculating percentages, RepeatMasker excluded the runs of Ns linking the contigs
in the draft genome sequence. In the last column, separate consensus sequences in the repeat
databases are considered subfamilies, rather than families, when the sequences are closely related
or related through intermediate subfamilies.
© 2001 Macmillan Magazines Ltd
to be responsible for most reverse transcription in the genome,
including the retrotransposition of the non-autonomous SINEs
144
and the creation of processed pseudogenes
145,146
. Three distantly
related LINE families are found in the human genome: LINE1,
LINE2 and LINE3. Only LINE1 is still active.
SINEs are wildly successful freeloaders on the backs of LINE
elements. They are short (about 100±400 bp), harbour an internal
polymerase III promoter and encode no proteins. These non-
autonomous transposons are thought to use the LINE machinery
for transposition. Indeed, most SINEs `live' by sharing the 39 end
with a resident LINE element
144
. The promoter regions of all known
SINEs are derived from tRNA sequences, with the exception of a
single monophyletic family of SINEs derived from the signal
recognition particle component 7SL. This family, which also does
not share its 39 end with a LINE, includes the only active SINE in the
human genome: the Alu element. By contrast, the mouse has both
tRNA-derived and 7SL-derived SINEs. The human genome con-
tains three distinct monophyletic families of SINEs: the active Alu,
and the inactive MIR and Ther2/MIR3.
LTR retroposons are ¯anked by long terminal direct repeats that
contain all of the necessary transcriptional regulatory elements. The
autonomous elements (retrotransposons) contain gag and pol
genes, which encode a protease, reverse transcriptase, RNAse H
and integrase. Exogenous retroviruses seem to have arisen from
endogenous retrotransposons by acquisition of a cellular envelope
gene (env)
147
. Transposition occurs through the retroviral mechan-
ism with reverse transcription occurring in a cytoplasmic virus-like
particle, primed by a tRNA (in contrast to the nuclear location and
chromosomal priming of LINEs). Although a variety of LTR retro-
transposons exist, only the vertebrate-speci®c endogenous retro-
viruses (ERVs) appear to have been active in the mammalian
genome. Mammalian retroviruses fall into three classes (I±III),
each comprising many families with independent origins. Most
(85%) of the LTR retroposon-derived `fossils' consist only of an
isolated LTR, with the internal sequence having been lost by
homologous recombination between the ¯anking LTRs.
DNA transposons resemble bacterial transposons, having term-
inal inverted repeats and encoding a transposase that binds near the
inverted repeats and mediates mobility through a `cut-and-paste'
mechanism. The human genome contains at least seven major
classes of DNA transposon, which can be subdivided into many
families with independent origins
148
(see RepBase, http://www.
girinst.org/,server/repbase.html). DNA transposons tend to have
short life spans within a species. This can be explained by contrast-
ing the modes of transposition of DNA transposons and LINE
elements. LINE transposition tends to involve only functional
elements, owing to the cis-preference by which LINE proteins
assemble with the RNA from which they were translated. By
contrast, DNA transposons cannot exercise a cis-preference: the
encoded transposase is produced in the cytoplasm and, when it
returns to the nucleus, it cannot distinguish active from inactive
elements. As inactive copies accumulate in the genome, transposi-
tion becomes less ef®cient. This checks the expansion of any DNA
transposon family and in due course causes it to die out. To survive,
DNA transposons must eventually move by horizontal transfer
to virgin genomes, and there is considerable evidence for such
transfer
149±153
.
Transposable elements employ different strategies to ensure their
evolutionary survival. LINEs and SINEs rely almost exclusively on
vertical transmission within the host genome
154
(but see refs 148,
155). DNA transposons are more promiscuous, requiring relatively
frequent horizontal transfer. LTR retroposons use both strategies,
with some being long-term active residents of the human genome
(such as members of the ERVL family) and others having only short
residence times.
articles
880 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
LINEs Autonomous 6–8 kb 850,000
Length Copy
number
SINEs Non-autonomous 100–300 bp 1,500,000
Retrovirus-like
elements
Autonomous 6–11 kb
450,000
Non-autonomous 1.5–3 kb
DNA
transposon
fossils
Autonomous 2–3 kb
300,000
21%
Fraction of
genome
13%
8%
3%
Non-autonomous 80–3,000 bp
ORF1 ORF2 (pol)
AAA
AAA
A B
gag pol (env)
(gag)
transposase
Classes of interspersed repeat in the human genome
Figure 17 Almost all transposable elements in mammals fall into one of four classes. See text for details.
Table 11 Number of copies and fraction of genome for classes of inter-
spersed repeat
Number of
copies (´ 1,000)
Total number of
bases in the draft
genome
sequence (Mb)
Fraction of the
draft genome
sequence (%)
Number of
families
(subfamilies)
SINEs 1,558 359.6 13.14
Alu 1,090 290.1 10.60 1 (,20)
MIR 393 60.1 2.20 1 (1)
MIR3 75 9.3 0.34 1 (1)
LINEs 868 558.8 20.42
LINE1 516 462.1 16.89 1 (,55)
LINE2 315 88.2 3.22 1 (2)
LINE3 37 8.4 0.31 1 (2)
LTR elements 443 227.0 8.29
ERV-class I 112 79.2 2.89 72 (132)
ERV(K)-class II 8 8.5 0.31 10 (20)
ERV (L)-class III 83 39.5 1.44 21 (42)
MaLR 240 99.8 3.65 1 (31)
DNA elements 294 77.6 2.84
hAT group
MER1-Charlie 182 38.1 1.39 25 (50)
Zaphod 13 4.3 0.16 4 (10)
Tc-1 group
MER2-Tigger 57 28.0 1.02 12 (28)
Tc2 4 0.9 0.03 1 (5)
Mariner 14 2.6 0.10 4 (5)
PiggyBac-like 2 0.5 0.02 10 (20)
Unclassi®ed 22 3.2 0.12 7 (7)
Unclassi®ed 3 3.8 0.14 3 (4)
Total interspersed
repeats
1,226.8 44.83
.............................................................................................................................................................................
The number of copies and base pair contributions of the major classes and subclasses of
transposable elements in the human genome. Data extracted from a RepeatMasker analysis of
the draft genome sequence (RepeatMasker version 09092000, sensitive settings, using RepBase
Update 5.08). In calculating percentages, RepeatMasker excluded the runs of Ns linking the contigs
in the draft genome sequence. In the last column, separate consensus sequences in the repeat
databases are considered subfamilies, rather than families, when the sequences are closely related
or related through intermediate subfamilies.
© 2001 Macmillan Magazines Ltd
Page 24
have suggested that small deletions occur at a rate that is 75-fold
higher in ¯ies than in mammals; the half-life of such nonfunctional
DNA is estimated at 12 Myr for ¯ies and 800 Myr for mammals
167
.
The rate of large deletions has not been systematically compared,
but seems likely also to differ markedly.
(3) Whereas in the human two repeat families (LINE1 and Alu)
account for 60% of all interspersed repeat sequence, the other
organisms have no dominant families. Instead, the worm, ¯y and
mustard weed genomes all contain many transposon families, each
consisting of typically hundreds to thousands of elements. This
difference may be explained by the observation that the vertically
transmitted, long-term residential LINE and SINE elements repre-
sent 75% of interspersed repeats in the human genome, but only 5±
25% in the other genomes. In contrast, the horizontally transmitted
and shorter-lived DNA transposons represent only a small portion
of all interspersed repeats in humans (6%) but a much larger
fraction in ¯y, mustard weed and worm (25%, 49% and 87%,
respectively). These features of the human genome are probably
general to all mammals. The relative lack of horizontally transmitted
elements may have its origin in the well developed immune system
of mammals, as horizontal transfer requires infectious vectors, such
as viruses, against which the immune system guards.
We also looked for differences among mammals, by comparing
the transposons in the human and mouse genomes. As with the
human genome, care is required in calibrating the substitution clock
for the mouse genome. There is considerable evidence that the rate
of substitution per Myr is higher in rodent lineages than in the
hominid lineages
139,168,169
. In fact, we found clear evidence for
different rates of substitution by examining families of transposable
elements whose insertions predate the divergence of the human and
mouse lineages. In an analysis of 22 such families, we found that the
substitution level was an average of 1.7-fold higher in mouse than
human (not shown). (This is likely to be an underestimate because
of an ascertainment bias against the most diverged copies.) The
faster clock in mouse is also evident from the fact that the ancient
LINE2 and MIR elements, which transposed before the mammalian
radiation and are readily detectable in the human genome, cannot
be readily identi®ed in available mouse genomic sequence (Fig. 18).
We used the best available estimates to calibrate substitution
levels and time
169
. The ratio of substitution rates varied from about
1.7-fold higher over the past 100 Myr to about 2.6-fold higher over
the past 25 Myr.
The analysis shows that, although the overall density of the four
transposon types in human and mouse is similar, the age distribu-
tion is strikingly different (Fig. 18). Transposon activity in the
mouse genome has not undergone the decline seen in humans and
proceeds at a much higher rate. In contrast to their possible
extinction in humans, LTR retroposons are alive and well in the
mouse with such representatives as the active IAP family and
putatively active members of the long-lived ERVL and MaLR
families. LINE1 and a variety of SINEs are quite active. These
evolutionary ®ndings are consistent with the empirical observations
that new spontaneous mutations are 30 times more likely to be
caused by LINE insertions in mouse than in human (,3% versus
0.1%)
170
and 60 times more likely to be caused by transposable
elements in general. It is estimated that around 1 in 600 mutations
in human are due to transpositions, whereas 10% of mutations in
mouse are due to transpositions (mostly IAP insertions).
The contrast between human and mouse suggests that the
explanation for the decline of transposon activity in humans may
lie in some fundamental difference between hominids and rodents.
Population structure and dynamics would seem to be likely sus-
pects. Rodents tend to have large populations, whereas hominid
populations tend to be small and may undergo frequent bottle-
necks. Evolutionary forces affected by such factors include inbreed-
ing and genetic drift, which might affect the persistence of active
transposable elements
171
. Studies in additional mammalian lineages
may shed light on the forces responsible for the differences in the
activity of transposable elements
172
.
Variation in the distribution of repeats. We next explored varia-
tion in the distribution of repeats across the draft genome sequence,
by calculating the repeat density in windows of various sizes across
the genome. There is striking variation at smaller scales.
Some regions of the genome are extraordinarily dense in repeats.
The prizewinner appears to be a 525-kb region on chromosome
Xp11, with an overall transposable element density of 89%. This
region contains a 200-kb segment with 98% density, as well as a
segment of 100 kb in which LINE1 sequences alone comprise 89% of
the sequence. In addition, there are regions of more than 100 kb
with extremely high densities of Alu (. 56% at three loci, including
one on 7q11 with a 50-kb stretch of . 61% Alu) and the ancient
transposons MIR (. 15% on chromosome 1p36) and LINE2
(. 18% on chromosome 22q12).
In contrast, some genomic regions are nearly devoid of repeats.
The absence of repeats may be a sign of large-scale cis-regulatory
elements that cannot tolerate being interrupted by insertions. The
four regions with the lowest density of interspersed repeats in the
human genome are the four homeobox gene clusters, HOXA,
HOXB, HOXC and HOXD (Fig. 21). Each locus contains regions
of around 100 kb containing less than 2% interspersed repeats.
Ongoing sequence analysis of the four HOX clusters in mouse, rat
and baboon shows a similar absence of transposable elements, and
reveals a high density of conserved noncoding elements (K. Dewar
and B. Birren, manuscript in preparation). The presence of a
complex collection of regulatory regions may explain why indivi-
dual HOX genes carried in transgenic mice fail to show proper
regulation.
It may be worth investigating other repeat-poor regions, such as a
region on chromosome 8q21 (1.5% repeat over 63 kb) containing a
gene encoding a homeodomain zinc-®nger protein (homologous to
mouse pID 9663936), a region on chromosome 1p36 (5% repeat
over 100 kb) with no obvious genes and a region on chromosome
18q22 (4% over 100 kb) containing three genes of unknown func-
tion (among which is KIAA0450). It will be interesting to see
whether the homologous regions in the mouse genome have
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 883
100 kb HoxD cluster
chr 22
chr 2
Figure 21 Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed
repeats; blue bars, exons of known genes. Note the de®cit of repeats in the HoxD cluster,
which contains a collection of genes with complex, interrelated regulation.
0
10
20
30
40
50
60
70
80
90
100
Hu
ma
n Fly
W
or
m
M
us
tar
d
we
ed
>25% Oldest
Youngest
<25%
<20%
<15%
<10%
<5%
P
ro
po
rt
io
n
of
in
te
rs
pe
rs
ed
re
pe
at
s
(%
)
Figure 20 Comparison of the age of interspersed repeats in eukaryotic genomes. The
copies of repeats were pooled by their nucleotide substitution level from the consensus.
© 2001 Macmillan Magazines Ltd
higher in ¯ies than in mammals; the half-life of such nonfunctional
DNA is estimated at 12 Myr for ¯ies and 800 Myr for mammals
167
.
The rate of large deletions has not been systematically compared,
but seems likely also to differ markedly.
(3) Whereas in the human two repeat families (LINE1 and Alu)
account for 60% of all interspersed repeat sequence, the other
organisms have no dominant families. Instead, the worm, ¯y and
mustard weed genomes all contain many transposon families, each
consisting of typically hundreds to thousands of elements. This
difference may be explained by the observation that the vertically
transmitted, long-term residential LINE and SINE elements repre-
sent 75% of interspersed repeats in the human genome, but only 5±
25% in the other genomes. In contrast, the horizontally transmitted
and shorter-lived DNA transposons represent only a small portion
of all interspersed repeats in humans (6%) but a much larger
fraction in ¯y, mustard weed and worm (25%, 49% and 87%,
respectively). These features of the human genome are probably
general to all mammals. The relative lack of horizontally transmitted
elements may have its origin in the well developed immune system
of mammals, as horizontal transfer requires infectious vectors, such
as viruses, against which the immune system guards.
We also looked for differences among mammals, by comparing
the transposons in the human and mouse genomes. As with the
human genome, care is required in calibrating the substitution clock
for the mouse genome. There is considerable evidence that the rate
of substitution per Myr is higher in rodent lineages than in the
hominid lineages
139,168,169
. In fact, we found clear evidence for
different rates of substitution by examining families of transposable
elements whose insertions predate the divergence of the human and
mouse lineages. In an analysis of 22 such families, we found that the
substitution level was an average of 1.7-fold higher in mouse than
human (not shown). (This is likely to be an underestimate because
of an ascertainment bias against the most diverged copies.) The
faster clock in mouse is also evident from the fact that the ancient
LINE2 and MIR elements, which transposed before the mammalian
radiation and are readily detectable in the human genome, cannot
be readily identi®ed in available mouse genomic sequence (Fig. 18).
We used the best available estimates to calibrate substitution
levels and time
169
. The ratio of substitution rates varied from about
1.7-fold higher over the past 100 Myr to about 2.6-fold higher over
the past 25 Myr.
The analysis shows that, although the overall density of the four
transposon types in human and mouse is similar, the age distribu-
tion is strikingly different (Fig. 18). Transposon activity in the
mouse genome has not undergone the decline seen in humans and
proceeds at a much higher rate. In contrast to their possible
extinction in humans, LTR retroposons are alive and well in the
mouse with such representatives as the active IAP family and
putatively active members of the long-lived ERVL and MaLR
families. LINE1 and a variety of SINEs are quite active. These
evolutionary ®ndings are consistent with the empirical observations
that new spontaneous mutations are 30 times more likely to be
caused by LINE insertions in mouse than in human (,3% versus
0.1%)
170
and 60 times more likely to be caused by transposable
elements in general. It is estimated that around 1 in 600 mutations
in human are due to transpositions, whereas 10% of mutations in
mouse are due to transpositions (mostly IAP insertions).
The contrast between human and mouse suggests that the
explanation for the decline of transposon activity in humans may
lie in some fundamental difference between hominids and rodents.
Population structure and dynamics would seem to be likely sus-
pects. Rodents tend to have large populations, whereas hominid
populations tend to be small and may undergo frequent bottle-
necks. Evolutionary forces affected by such factors include inbreed-
ing and genetic drift, which might affect the persistence of active
transposable elements
171
. Studies in additional mammalian lineages
may shed light on the forces responsible for the differences in the
activity of transposable elements
172
.
Variation in the distribution of repeats. We next explored varia-
tion in the distribution of repeats across the draft genome sequence,
by calculating the repeat density in windows of various sizes across
the genome. There is striking variation at smaller scales.
Some regions of the genome are extraordinarily dense in repeats.
The prizewinner appears to be a 525-kb region on chromosome
Xp11, with an overall transposable element density of 89%. This
region contains a 200-kb segment with 98% density, as well as a
segment of 100 kb in which LINE1 sequences alone comprise 89% of
the sequence. In addition, there are regions of more than 100 kb
with extremely high densities of Alu (. 56% at three loci, including
one on 7q11 with a 50-kb stretch of . 61% Alu) and the ancient
transposons MIR (. 15% on chromosome 1p36) and LINE2
(. 18% on chromosome 22q12).
In contrast, some genomic regions are nearly devoid of repeats.
The absence of repeats may be a sign of large-scale cis-regulatory
elements that cannot tolerate being interrupted by insertions. The
four regions with the lowest density of interspersed repeats in the
human genome are the four homeobox gene clusters, HOXA,
HOXB, HOXC and HOXD (Fig. 21). Each locus contains regions
of around 100 kb containing less than 2% interspersed repeats.
Ongoing sequence analysis of the four HOX clusters in mouse, rat
and baboon shows a similar absence of transposable elements, and
reveals a high density of conserved noncoding elements (K. Dewar
and B. Birren, manuscript in preparation). The presence of a
complex collection of regulatory regions may explain why indivi-
dual HOX genes carried in transgenic mice fail to show proper
regulation.
It may be worth investigating other repeat-poor regions, such as a
region on chromosome 8q21 (1.5% repeat over 63 kb) containing a
gene encoding a homeodomain zinc-®nger protein (homologous to
mouse pID 9663936), a region on chromosome 1p36 (5% repeat
over 100 kb) with no obvious genes and a region on chromosome
18q22 (4% over 100 kb) containing three genes of unknown func-
tion (among which is KIAA0450). It will be interesting to see
whether the homologous regions in the mouse genome have
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 883
100 kb HoxD cluster
chr 22
chr 2
Figure 21 Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed
repeats; blue bars, exons of known genes. Note the de®cit of repeats in the HoxD cluster,
which contains a collection of genes with complex, interrelated regulation.
0
10
20
30
40
50
60
70
80
90
100
Hu
ma
n Fly
W
or
m
M
us
tar
d
we
ed
>25% Oldest
Youngest
<25%
<20%
<15%
<10%
<5%
P
ro
po
rt
io
n
of
in
te
rs
pe
rs
ed
re
pe
at
s
(%
)
Figure 20 Comparison of the age of interspersed repeats in eukaryotic genomes. The
copies of repeats were pooled by their nucleotide substitution level from the consensus.
© 2001 Macmillan Magazines Ltd
Page 27
genomes
184±187
. By studying sets of repeat elements belonging to a
common cohort, one can directly measure nucleotide substitution
rates in different regions of the genome. We ®nd strong evidence
that the pattern of neutral substitution differs as a function of local
GC content (Fig. 27). Because the results are observed in repetitive
elements throughout the genome, the variation in the pattern of
nucleotide substitution seems likely to be due to differences in the
underlying mutational process rather than to selection.
The effect can be seen most clearly by focusing on the substitution
process g $ a, where g denotes GC or CG base pairs and a denotes
ATor TA base pairs. If K is the equilibrium constant in the direction
of a base pairs (de®ned by the ratio of the forward and reverse
rates), then the equilibrium GC content should be 1/(1 + K). Two
observations emerge.
First, there is a regional bias in substitution patterns. The
equilibrium constant varies as a function of local GC content: g
base pairs are more likely to mutate towards a base pairs in AT-rich
regions than in GC-rich regions. For the analysis in Fig. 27, the
equilibrium constant K is 2.5, 1.9 and 1.2 when the draft genome
sequence is partitioned into three bins with average GC content of
37, 43 and 50%, respectively. This bias could be due to a reported
tendency for GC-rich regions to replicate earlier in the cell cycle
than AT-rich regions and for guanine pools, which are limiting for
DNA replication, to become depleted late in the cell cycle, thereby
resulting in a small but signi®cant shift in substitution towards a
base pairs
186,188
. Another theory proposes that many substitutions
are due to differences in DNA repair mechanisms, possibly related
to transcriptional activity and thereby to gene density and GC
content
185,189,190
.
There is also an absolute bias in substitution patterns resulting in
directional pressure towards lower GC content throughout the
human genome. The genome is not at equilibrium with respect to
the pattern of nucleotide substitution: the expected equilibrium GC
content corresponding to the values of K above is 29, 35 and 44% for
regions with average GC contents of 37, 43 and 50%, respectively.
Recent observations on SNPs
190
con®rm that the mutation pattern
in GC-rich DNA is biased towards a base pairs; it should be possible
to perform similar analyses throughout the genome with the
availability of 1.4 million SNPs
97,191
. On the basis solely of nucleotide
substitution patterns, the GC content would be expected to be about
7% lower throughout the genome.
What accounts for the higher GC content? One possible explana-
tion is that in GC-rich regions, a considerable fraction of the
nucleotides is likely to be under functional constraint owing to
the high gene density. Selection on coding regions and regulatory
CpG islands may maintain the higher-than-predicted GC content.
Another is that throughout the rest of the genome, a constant in¯ux
of transposable elements tends to increase GC content (Fig. 28).
Young repeat elements clearly have a higher GC content than their
surrounding regions, except in extremely GC-rich regions. More-
over, repeat elements clearly shift with age towards a lower GC
content, closer to that of the neighbourhood in which they reside.
Much of the `non-repeat' DNA in AT-rich regions probably consists
of ancient repeats that are not detectable by current methods and
that have had more time to approach the local equilibrium value.
The repeats can also be used to study how the mutation process is
affected by the immediately adjacent nucleotide. Such `context
effects' will be discussed elsewhere (A. Kas and A. F. A. Smit,
unpublished results).
Fast living on chromosome Y. The pattern of interspersed repeats
can be used to shed light on the unusual evolutionary history of
chromosome Y. Our analysis shows that the genetic material on
articles
886 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
0
2
4
6
8
10
12
14
16
18
20
G A
C T
A G
T C
G T
C A
A C
T G
A T
T A
G C
C G
Type of substitution
S
ub
st
itu
tio
n
le
ve
l
in
1
8%
d
iv
er
ge
d
se
qu
en
ce
(%
)
37% GC
43% GC
50% GC
Average background
nucleotide composition
Figure 27 Substitution patterns in interspersed repeats differ as a function of GC content.
We collected all copies of ®ve DNA transposons (Tigger1, Tigger2, Charlie3, MER1 and
HSMAR2), chosen for their high copy number and well de®ned consensus sequences.
DNA transposons are optimal for the study of neutral substitutions: they do not segregate
into subfamilies with diagnostic differences, presumably because they are short-lived and
new active families do not evolve in a genome (see text). Duplicates and close paralogues
resulting from duplication after transposition were eliminated. The copies were grouped
on the basis of GC content of the ¯anking 1,000 bp on both sides and aligned to the
consensus sequence (representing the state of the copy at integration). Recursive efforts
using parameters arising from this study did not change the alignments signi®cantly.
Alignments were inspected by hand, and obvious misalignments caused by insertions and
duplications were eliminated. Substitutions (n 80; 000) were counted for each position
in the consensus, excluding those in CpG dinucleotides, and a substitution frequency
matrix was de®ned. From the matrices for each repeat (which corresponded to different
ages), a single rate matrix was calculated for these bins of GC content (, 40% GC, 40±
47% GC and . 47% GC). Data are shown for a repeat with an average divergence (in
non-CpG sites) of 18% in 43% GC content (the repeat has slightly higher divergence in
AT-rich DNA and lower in GC-rich DNA). From the rate matrix, we calculated log-likelihood
matrices with different entropies (divergence levels), which are theoretically optimal for
alignments of neutrally diverged copies to their common ancestral state (A. Kas and
A. F. A. Smit, unpublished). These matrices are in use by the RepeatMasker program.
30
35
40
45
50
55
60
65
<36 39 43 47 51 >54
GC content of genome DNA (%)
G
C
c
on
te
nt
o
f f
ea
tu
re
(%
)
All DNA
Young interspersed repeats
(<10% divergence)
All interspersed repeats
Non-repeat DNA
Figure 28 Interspersed repeats tend to diminish the differences between GC bins, despite
the fact that GC-rich transposable elements (speci®cally Alu) accumulate in GC-rich DNA,
and AT-rich elements (LINE1) in AT-rich DNA. The GC content of particular components of
the sequence (repeats, young repeats and non-repeat sequence) was calculated as a
function of overall GC content.
© 2001 Macmillan Magazines Ltd
184±187
. By studying sets of repeat elements belonging to a
common cohort, one can directly measure nucleotide substitution
rates in different regions of the genome. We ®nd strong evidence
that the pattern of neutral substitution differs as a function of local
GC content (Fig. 27). Because the results are observed in repetitive
elements throughout the genome, the variation in the pattern of
nucleotide substitution seems likely to be due to differences in the
underlying mutational process rather than to selection.
The effect can be seen most clearly by focusing on the substitution
process g $ a, where g denotes GC or CG base pairs and a denotes
ATor TA base pairs. If K is the equilibrium constant in the direction
of a base pairs (de®ned by the ratio of the forward and reverse
rates), then the equilibrium GC content should be 1/(1 + K). Two
observations emerge.
First, there is a regional bias in substitution patterns. The
equilibrium constant varies as a function of local GC content: g
base pairs are more likely to mutate towards a base pairs in AT-rich
regions than in GC-rich regions. For the analysis in Fig. 27, the
equilibrium constant K is 2.5, 1.9 and 1.2 when the draft genome
sequence is partitioned into three bins with average GC content of
37, 43 and 50%, respectively. This bias could be due to a reported
tendency for GC-rich regions to replicate earlier in the cell cycle
than AT-rich regions and for guanine pools, which are limiting for
DNA replication, to become depleted late in the cell cycle, thereby
resulting in a small but signi®cant shift in substitution towards a
base pairs
186,188
. Another theory proposes that many substitutions
are due to differences in DNA repair mechanisms, possibly related
to transcriptional activity and thereby to gene density and GC
content
185,189,190
.
There is also an absolute bias in substitution patterns resulting in
directional pressure towards lower GC content throughout the
human genome. The genome is not at equilibrium with respect to
the pattern of nucleotide substitution: the expected equilibrium GC
content corresponding to the values of K above is 29, 35 and 44% for
regions with average GC contents of 37, 43 and 50%, respectively.
Recent observations on SNPs
190
con®rm that the mutation pattern
in GC-rich DNA is biased towards a base pairs; it should be possible
to perform similar analyses throughout the genome with the
availability of 1.4 million SNPs
97,191
. On the basis solely of nucleotide
substitution patterns, the GC content would be expected to be about
7% lower throughout the genome.
What accounts for the higher GC content? One possible explana-
tion is that in GC-rich regions, a considerable fraction of the
nucleotides is likely to be under functional constraint owing to
the high gene density. Selection on coding regions and regulatory
CpG islands may maintain the higher-than-predicted GC content.
Another is that throughout the rest of the genome, a constant in¯ux
of transposable elements tends to increase GC content (Fig. 28).
Young repeat elements clearly have a higher GC content than their
surrounding regions, except in extremely GC-rich regions. More-
over, repeat elements clearly shift with age towards a lower GC
content, closer to that of the neighbourhood in which they reside.
Much of the `non-repeat' DNA in AT-rich regions probably consists
of ancient repeats that are not detectable by current methods and
that have had more time to approach the local equilibrium value.
The repeats can also be used to study how the mutation process is
affected by the immediately adjacent nucleotide. Such `context
effects' will be discussed elsewhere (A. Kas and A. F. A. Smit,
unpublished results).
Fast living on chromosome Y. The pattern of interspersed repeats
can be used to shed light on the unusual evolutionary history of
chromosome Y. Our analysis shows that the genetic material on
articles
886 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
0
2
4
6
8
10
12
14
16
18
20
G A
C T
A G
T C
G T
C A
A C
T G
A T
T A
G C
C G
Type of substitution
S
ub
st
itu
tio
n
le
ve
l
in
1
8%
d
iv
er
ge
d
se
qu
en
ce
(%
)
37% GC
43% GC
50% GC
Average background
nucleotide composition
Figure 27 Substitution patterns in interspersed repeats differ as a function of GC content.
We collected all copies of ®ve DNA transposons (Tigger1, Tigger2, Charlie3, MER1 and
HSMAR2), chosen for their high copy number and well de®ned consensus sequences.
DNA transposons are optimal for the study of neutral substitutions: they do not segregate
into subfamilies with diagnostic differences, presumably because they are short-lived and
new active families do not evolve in a genome (see text). Duplicates and close paralogues
resulting from duplication after transposition were eliminated. The copies were grouped
on the basis of GC content of the ¯anking 1,000 bp on both sides and aligned to the
consensus sequence (representing the state of the copy at integration). Recursive efforts
using parameters arising from this study did not change the alignments signi®cantly.
Alignments were inspected by hand, and obvious misalignments caused by insertions and
duplications were eliminated. Substitutions (n 80; 000) were counted for each position
in the consensus, excluding those in CpG dinucleotides, and a substitution frequency
matrix was de®ned. From the matrices for each repeat (which corresponded to different
ages), a single rate matrix was calculated for these bins of GC content (, 40% GC, 40±
47% GC and . 47% GC). Data are shown for a repeat with an average divergence (in
non-CpG sites) of 18% in 43% GC content (the repeat has slightly higher divergence in
AT-rich DNA and lower in GC-rich DNA). From the rate matrix, we calculated log-likelihood
matrices with different entropies (divergence levels), which are theoretically optimal for
alignments of neutrally diverged copies to their common ancestral state (A. Kas and
A. F. A. Smit, unpublished). These matrices are in use by the RepeatMasker program.
30
35
40
45
50
55
60
65
<36 39 43 47 51 >54
GC content of genome DNA (%)
G
C
c
on
te
nt
o
f f
ea
tu
re
(%
)
All DNA
Young interspersed repeats
(<10% divergence)
All interspersed repeats
Non-repeat DNA
Figure 28 Interspersed repeats tend to diminish the differences between GC bins, despite
the fact that GC-rich transposable elements (speci®cally Alu) accumulate in GC-rich DNA,
and AT-rich elements (LINE1) in AT-rich DNA. The GC content of particular components of
the sequence (repeats, young repeats and non-repeat sequence) was calculated as a
function of overall GC content.
© 2001 Macmillan Magazines Ltd
Page 28
chromosome Y is unusually young, probably owing to a high
tolerance for gain of new material by insertion and loss of old
material by deletion. Several lines of evidence support this picture.
For example, LINE elements on chromosome Yare on average much
younger than those on autosomes (not shown). Similarly, MaLR-
family retroposons on chromosome Y are younger than those on
autosomes, with the representation of subfamilies showing a strong
inverse correlation with the age of the subfamily. Moreover, chro-
mosome Y has a relative over-representation of the younger retro-
viral class II (ERVK) and a relative under-representation of the
primarily older class III (ERVL) compared with other chromo-
somes. Overall, chromosome Y seems to maintain a youthful
appearance by rapid turnover.
Interspersed repeats on chromosome Y can also be used to
estimate the relative mutation rates, a
m
and a
f
, in the male and
female germlines. Chromosome Y always resides in males, whereas
chromosome X resides in females twice as often as in males. The
substitution rates, m
Y
and m
X
, on these two chromosomes should
thus be in the ratio m
Y
:m
X
= (a
m
):(a
m
+ 2a
f
)/3, provided that one
considers equivalent neutral sequences. Several authors have esti-
mated the mutation rate in the male germline to be ®vefold higher
than in the female germline, by comparing the rates of evolution of
X- and Y-linked genes in humans and primates. However, Page and
colleagues
192
have challenged these estimates as too high. They
studied a 39-kb region that is apparently devoid of genes and resides
within a large segmental duplication from X to Y that occurred 3±4
Myr ago in the human lineage. On the basis of phylogenetic analysis
of the sequence on human Y and human, chimp and gorilla X, they
obtained a much lower estimate of m
Y
:m
X
= 1.36, corresponding to
a
m
:a
f
= 1.7. They suggested that the other estimates may have been
higher because they were based on much longer evolutionary
periods or because the genes studied may have been under selection.
Our database of human repeats provides a powerful resource for
addressing this question. We identi®ed the repeat elements from
recent subfamilies (effectively, birth cohorts dating from the past
50 Myr) and measured the substitution rates for subfamily members
on chromosomes X and Y (Fig. 29). There is a clear linear relation-
ship with a slope of m
Y
:m
X
= 1.57 corresponding to a
m
:a
f
= 2.1. The
estimate is in reasonable agreement with that of Page et al., although
it is based on much more total sequence (360 kb on Y, 1.6 Mb on X)
and a much longer time period. In particular, the discrepancy with
earlier reports is not explained by recent changes in the human
lineage. Various theories have been proposed for the higher muta-
tion rate in the male germline, including the greater number of cell
divisions in the formation of sperm than eggs and different repair
mechanisms in sperm and eggs.
Active transposons. We were interested in identifying the youngest
retrotransposons in the draft genome sequence. This set should
contain the currently active retrotransposons, as well as the inser-
tion sites that are still polymorphic in the human population.
The youngest branch in the phylogenetic tree of human LINE1
elements is called L1Hs (ref. 158); it differs in its 39 untranslated
region (UTR) by 12 diagnostic substitutions from the next oldest
subfamily (L1PA2). Within the L1Hs family, there are two
subsets referred to as Ta and pre-Ta, de®ned by a diagnostic
trinucleotide
193,194
. All active L1 elements are thought to belong to
these two subsets, because they account for all 14 known cases of
human disease arising from new L1 transposition (with 13 belong-
ing to the Ta subset and one to the pre-Ta subset)
195,196
. These
subsets are also of great interest for population genetics because at
least 50% are still segregating as polymorphisms in the human
population
194,197
; they provide powerful markers for tracing
population history because they represent unique (non-recurrent
and non-revertible) genetic events that can be used (along with
similarly polymorphic Alus) for reconstructing human migrations.
LINE1 elements that are retrotransposition-competent should
consist of a full-length sequence and should have both ORFs intact.
Eleven such elements from the Ta subset have been identi®ed,
including the likely progenitors of mutagenic insertions into the
factor VIII and dystrophin genes
198±202
. A cultured cell retrotrans-
position assay has revealed that eight of these elements remain
retrotransposition-competent
200,202,203
.
We searched the draft genome sequence and identi®ed 535 LINEs
belonging to the Ta subset and 415 belonging to the pre-Ta subset.
These elements provide a large collection of tools for probing
human population history. We also identi®ed those consisting of
full-length elements with intact ORFs, which are candidate active
LINEs. We found 39 such elements belonging to the Ta subset and
22 belonging to the pre-Ta subset; this substantially increases the
number in the ®rst category and provides the ®rst known examples
in the second category. These elements can now be tested for
retrotransposition competence in the cell culture assay. Preliminary
analysis resulted in the identi®cation of two of these elements as the
likely progenitors of mutagenic insertions into the b-globin and
RP2 genes (R. Badge and J. V. Moran, unpublished data). Similar
analyses should allow the identi®cation of the progenitors of most,
if not all, other known mutagenic L1 insertions.
L1 elements can carry extra DNA if transcription extends through
the native transcriptional termination site into ¯anking genomic
DNA. This process, termed L1-mediated transduction, provides a
means for the mobilization of DNA sequences around the genome
and may be a mechanism for `exon shuf¯ing'
204
. Twenty-one per
cent of the 71 full-length L1s analysed contained non-L1-derived
sequences before the 39 target-site duplication site, in cases in which
the site was unambiguously recognizable. The length of the trans-
duced sequence was 30±970 bp, supporting the suggestion that 0.5±
1.0% of the human genome may have arisen by LINE-based
transduction of 39 ¯anking sequences
205,206
.
Our analysis also turned up two instances of 59 transduction
(145 bp and 215 bp). Although this possibility had been suggested
on the basis of cell culture models
195,203
, these are the ®rst docu-
mented examples. Such events may arise from transcription initiat-
ing in a cellular promoter upstream of the L1 elements. L1
transcription is generally con®ned to the germline
207,208
, but
transcription from other promoters could explain a somatic L1
retrotransposition event that resulted in colon cancer
206
.
Transposons as a creative force. The primary force for the origin
and expansion of most transposons has been selection for their
ability to create progeny, and not a selective advantage for the host.
However, these sel®sh pieces of DNA have been responsible for
important innovations in many genomes, for example by contri-
buting regulatory elements and even new genes.
Twenty human genes have been recognized as probably derived
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 887
0
5
10
0 5 10
Median substitution level of
repeat subfamily on X (%)
M
ed
ia
n
su
bs
tit
ut
io
n
le
ve
l o
f
re
pe
at
s
ub
fa
m
ily
o
n
Y
(%
)
Figure 29 Higher substitution rate on chromosome Y than on chromosome X. We
calculated the median substitution level (excluding CpG sites) for copies of the most recent
L1 subfamilies (L1Hs±L1PA8) on the X and Y chromosomes. Only the 39 UTR of the L1
element was considered because its consensus sequence is best established.
© 2001 Macmillan Magazines Ltd
tolerance for gain of new material by insertion and loss of old
material by deletion. Several lines of evidence support this picture.
For example, LINE elements on chromosome Yare on average much
younger than those on autosomes (not shown). Similarly, MaLR-
family retroposons on chromosome Y are younger than those on
autosomes, with the representation of subfamilies showing a strong
inverse correlation with the age of the subfamily. Moreover, chro-
mosome Y has a relative over-representation of the younger retro-
viral class II (ERVK) and a relative under-representation of the
primarily older class III (ERVL) compared with other chromo-
somes. Overall, chromosome Y seems to maintain a youthful
appearance by rapid turnover.
Interspersed repeats on chromosome Y can also be used to
estimate the relative mutation rates, a
m
and a
f
, in the male and
female germlines. Chromosome Y always resides in males, whereas
chromosome X resides in females twice as often as in males. The
substitution rates, m
Y
and m
X
, on these two chromosomes should
thus be in the ratio m
Y
:m
X
= (a
m
):(a
m
+ 2a
f
)/3, provided that one
considers equivalent neutral sequences. Several authors have esti-
mated the mutation rate in the male germline to be ®vefold higher
than in the female germline, by comparing the rates of evolution of
X- and Y-linked genes in humans and primates. However, Page and
colleagues
192
have challenged these estimates as too high. They
studied a 39-kb region that is apparently devoid of genes and resides
within a large segmental duplication from X to Y that occurred 3±4
Myr ago in the human lineage. On the basis of phylogenetic analysis
of the sequence on human Y and human, chimp and gorilla X, they
obtained a much lower estimate of m
Y
:m
X
= 1.36, corresponding to
a
m
:a
f
= 1.7. They suggested that the other estimates may have been
higher because they were based on much longer evolutionary
periods or because the genes studied may have been under selection.
Our database of human repeats provides a powerful resource for
addressing this question. We identi®ed the repeat elements from
recent subfamilies (effectively, birth cohorts dating from the past
50 Myr) and measured the substitution rates for subfamily members
on chromosomes X and Y (Fig. 29). There is a clear linear relation-
ship with a slope of m
Y
:m
X
= 1.57 corresponding to a
m
:a
f
= 2.1. The
estimate is in reasonable agreement with that of Page et al., although
it is based on much more total sequence (360 kb on Y, 1.6 Mb on X)
and a much longer time period. In particular, the discrepancy with
earlier reports is not explained by recent changes in the human
lineage. Various theories have been proposed for the higher muta-
tion rate in the male germline, including the greater number of cell
divisions in the formation of sperm than eggs and different repair
mechanisms in sperm and eggs.
Active transposons. We were interested in identifying the youngest
retrotransposons in the draft genome sequence. This set should
contain the currently active retrotransposons, as well as the inser-
tion sites that are still polymorphic in the human population.
The youngest branch in the phylogenetic tree of human LINE1
elements is called L1Hs (ref. 158); it differs in its 39 untranslated
region (UTR) by 12 diagnostic substitutions from the next oldest
subfamily (L1PA2). Within the L1Hs family, there are two
subsets referred to as Ta and pre-Ta, de®ned by a diagnostic
trinucleotide
193,194
. All active L1 elements are thought to belong to
these two subsets, because they account for all 14 known cases of
human disease arising from new L1 transposition (with 13 belong-
ing to the Ta subset and one to the pre-Ta subset)
195,196
. These
subsets are also of great interest for population genetics because at
least 50% are still segregating as polymorphisms in the human
population
194,197
; they provide powerful markers for tracing
population history because they represent unique (non-recurrent
and non-revertible) genetic events that can be used (along with
similarly polymorphic Alus) for reconstructing human migrations.
LINE1 elements that are retrotransposition-competent should
consist of a full-length sequence and should have both ORFs intact.
Eleven such elements from the Ta subset have been identi®ed,
including the likely progenitors of mutagenic insertions into the
factor VIII and dystrophin genes
198±202
. A cultured cell retrotrans-
position assay has revealed that eight of these elements remain
retrotransposition-competent
200,202,203
.
We searched the draft genome sequence and identi®ed 535 LINEs
belonging to the Ta subset and 415 belonging to the pre-Ta subset.
These elements provide a large collection of tools for probing
human population history. We also identi®ed those consisting of
full-length elements with intact ORFs, which are candidate active
LINEs. We found 39 such elements belonging to the Ta subset and
22 belonging to the pre-Ta subset; this substantially increases the
number in the ®rst category and provides the ®rst known examples
in the second category. These elements can now be tested for
retrotransposition competence in the cell culture assay. Preliminary
analysis resulted in the identi®cation of two of these elements as the
likely progenitors of mutagenic insertions into the b-globin and
RP2 genes (R. Badge and J. V. Moran, unpublished data). Similar
analyses should allow the identi®cation of the progenitors of most,
if not all, other known mutagenic L1 insertions.
L1 elements can carry extra DNA if transcription extends through
the native transcriptional termination site into ¯anking genomic
DNA. This process, termed L1-mediated transduction, provides a
means for the mobilization of DNA sequences around the genome
and may be a mechanism for `exon shuf¯ing'
204
. Twenty-one per
cent of the 71 full-length L1s analysed contained non-L1-derived
sequences before the 39 target-site duplication site, in cases in which
the site was unambiguously recognizable. The length of the trans-
duced sequence was 30±970 bp, supporting the suggestion that 0.5±
1.0% of the human genome may have arisen by LINE-based
transduction of 39 ¯anking sequences
205,206
.
Our analysis also turned up two instances of 59 transduction
(145 bp and 215 bp). Although this possibility had been suggested
on the basis of cell culture models
195,203
, these are the ®rst docu-
mented examples. Such events may arise from transcription initiat-
ing in a cellular promoter upstream of the L1 elements. L1
transcription is generally con®ned to the germline
207,208
, but
transcription from other promoters could explain a somatic L1
retrotransposition event that resulted in colon cancer
206
.
Transposons as a creative force. The primary force for the origin
and expansion of most transposons has been selection for their
ability to create progeny, and not a selective advantage for the host.
However, these sel®sh pieces of DNA have been responsible for
important innovations in many genomes, for example by contri-
buting regulatory elements and even new genes.
Twenty human genes have been recognized as probably derived
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 887
0
5
10
0 5 10
Median substitution level of
repeat subfamily on X (%)
M
ed
ia
n
su
bs
tit
ut
io
n
le
ve
l o
f
re
pe
at
s
ub
fa
m
ily
o
n
Y
(%
)
Figure 29 Higher substitution rate on chromosome Y than on chromosome X. We
calculated the median substitution level (excluding CpG sites) for copies of the most recent
L1 subfamilies (L1Hs±L1PA8) on the X and Y chromosomes. Only the 39 UTR of the L1
element was considered because its consensus sequence is best established.
© 2001 Macmillan Magazines Ltd
Page 29
from transposons
142,209
. These include the RAG1 and RAG2 recom-
binases and the major centromere-binding protein CENPB. We
scanned the draft genome sequence and identi®ed another 27 cases,
bringing the total to 47 (Table 13; refs 142, 209). All but four are
derived from DNA transposons, which give rise to only a small
proportion of the interspersed repeats in the genome. Why there are
so many DNA transposase-like genes, many of which still contain
the critical residues for transposase activity, is a mystery.
To illustrate this concept, we describe the discovery of one of the
new examples. We searched the draft genome sequence to identify
the autonomous DNA transposon responsible for the distribution
of the non-autonomous MER85 element, one of the most recently
(40±50 Myr ago) active DNA transposons. Most non-autonomous
elements are internal deletion products of a DNA transposon. We
identi®ed one instance of a large (1,782 bp) ORF ¯anked by the 59
and 39 halves of a MER85 element. The ORF encodes a novel protein
(partially published as pID 6453533) whose closest homologue is
the transposase of the piggyBac DNA transposon, which is found in
insects and has the same characteristic TTAA target-site
duplications
210
as MER85. The ORF is actively transcribed in fetal
brain and in cancer cells. That it has not been lost to mutation in
40±50 Myr of evolution (whereas the ¯anking, noncoding, MER85-
like termini show the typical divergence level of such elements) and
is actively transcribed provides strong evidence that it has been
adopted by the human genome as a gene. Its function is unknown.
LINE1 activity clearly has also had fringe bene®ts. We mentioned
above the possibility of exon reshuf¯ing by cotranscription of
neighbouring DNA. The LINE1 machinery can also cause reverse
transcription of genic mRNAs, which typically results in nonfunc-
tional processed pseudogenes but can, occasionally, give rise to
functional processed genes. There are at least eight human and
eight mouse genes for which evidence strongly supports such an
origin
211
(see http://www-i®.uni-muenster.de/exapted-retrogenes/
tables.html). Many other intronless genes may have been created
in the same way.
Transposons have made other creative contributions to the
genome. A few hundred genes, for example, use transcriptional
terminators donated by LTR retroposons (data not shown). Other
genes employ regulatory elements derived from repeat elements
211
.
Simple sequence repeats
Simple sequence repeats (SSRs) are a rather different type of
repetitive structure that is common in the human genomeÐperfect
or slightly imperfect tandem repeats of a particular k-mer. SSRs with
a short repeat unit (n = 1±13 bases) are often termed microsa-
articles
888 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 13 Human genes derived from transposable elements
GenBank ID* Gene name Related transposon family² Possible fusion gene§ Newly recognized derivationk
nID 3150436 BC200 FLAM Alu³
pID 2330017 Telomerase non-LTR retrotransposon
pID 1196425 HERV-3 env Retroviridae/HERV-R³
pID 4773880 Syncytin Retroviridae/HERV-W³
pID 131827 RAG1 and 2 Tc1-like
pID 29863 CENP-B Tc1/Pogo
EST 2529718 Tc1/Pogo +
PID 10047247 Tc1/Pogo/Pogo +
EST 4524463 Tc1/Pogo/Pogo +
pID 4504807 Jerky Tc1/Pogo/Tigger
pID 7513096 JRKL Tc1/Pogo/Tigger
EST 5112721 Tc1/Pogo/Tigger +
EST 11097233 Tc1/Pogo/Tigger +
EST 6986275 Sancho Tc1/Pogo/Tigger
EST 8616450 Tc1/Pogo/Tigger +
EST 8750408 Tc1/Pogo/Tigger +
EST 5177004 Tc1/Pogo/Tigger +
PID 3413884 KIAA0461 Tc1/Pogo/Tc2 +
PID 7959287 KIAA1513 Tc1/Pogo/Tc2 +
PID 2231380 Tc1/Mariner/Hsmar1³ +
EST 10219887 hAT/Hobo + +
PID 6581095 Buster1 hAT/Charlie +
PID 7243087 Buster2 hAT/Charlie +
PID 6581097 Buster3 hAT/Charlie
PID 7662294 KIAA0766 hAT/Charlie +
PID 10439678 hAT/Charlie +
PID 7243087 KIAA1353 hAT/Charlie +
PID 7021900 hAT/Charlie/Charlie3³ +
PID 4263748 hAT/Charlie/Charlie8³ +
EST 8161741 hAT/Charlie/Charlie9³ +
pID 4758872 DAP4,pP52
rIPK
hAT/Tip100/Zaphod
EST 10990063 hAT/Tip100/Zaphod +
EST 10101591 hAT/Tip100/Zaphod +
pID 7513011 KIAA0543 hAT/Tip100/Tip100 +
pID 10439744 hAT/Tip100/Tip100 +
pID 10047247 KIAA1586 hAT/Tip100/Tip100 +
pID 10439762 hAT/Tip100 + +
EST 10459804 hAT/Tip100 +
pID 4160548 Tramp hAT/Tam3 +
BAC 3522927 hAT/Tam3 +
pID 3327088 KIAA0637 hAT/Tam3 +
EST 1928552 hAT/Tam3 +
pID 6453533 piggyBac/MER85³ +
EST 3594004 piggyBac/MER85³ +
BAC 4309921 piggyBac/MER85³ +
EST 4073914 piggyBac/MER75³ +
EST 1963278 piggyBac +
...................................................................................................................................................................................................................................................................................................................................................................
The Table lists 47 human genes, with a likely origin in up to 38 different transposon copies.
* Where available, the GenBank ID numbers are given for proteins, otherwise a representative EST or a clone name is shown. Six groups (two or three genes each) have similarity at the DNA level well beyond
that observed between different DNA transposon families in the genome; they are indicated in italics, with all but the initial member of each group indented. This could be explained if the genes were
paralogous (derived from a single inserted transposon and subsequently duplicated).
² Classi®cation of the transposon.
³ Indicates that the transposon from which the gene is derived is precisely known.
§ Proteins probably formed by fusion of a cellular and transposon gene; many have acquired zinc-®nger domains.
kNot previously reported as being derived from transposable element genes. The remaining genes can be found in refs 142, 209.
© 2001 Macmillan Magazines Ltd
142,209
. These include the RAG1 and RAG2 recom-
binases and the major centromere-binding protein CENPB. We
scanned the draft genome sequence and identi®ed another 27 cases,
bringing the total to 47 (Table 13; refs 142, 209). All but four are
derived from DNA transposons, which give rise to only a small
proportion of the interspersed repeats in the genome. Why there are
so many DNA transposase-like genes, many of which still contain
the critical residues for transposase activity, is a mystery.
To illustrate this concept, we describe the discovery of one of the
new examples. We searched the draft genome sequence to identify
the autonomous DNA transposon responsible for the distribution
of the non-autonomous MER85 element, one of the most recently
(40±50 Myr ago) active DNA transposons. Most non-autonomous
elements are internal deletion products of a DNA transposon. We
identi®ed one instance of a large (1,782 bp) ORF ¯anked by the 59
and 39 halves of a MER85 element. The ORF encodes a novel protein
(partially published as pID 6453533) whose closest homologue is
the transposase of the piggyBac DNA transposon, which is found in
insects and has the same characteristic TTAA target-site
duplications
210
as MER85. The ORF is actively transcribed in fetal
brain and in cancer cells. That it has not been lost to mutation in
40±50 Myr of evolution (whereas the ¯anking, noncoding, MER85-
like termini show the typical divergence level of such elements) and
is actively transcribed provides strong evidence that it has been
adopted by the human genome as a gene. Its function is unknown.
LINE1 activity clearly has also had fringe bene®ts. We mentioned
above the possibility of exon reshuf¯ing by cotranscription of
neighbouring DNA. The LINE1 machinery can also cause reverse
transcription of genic mRNAs, which typically results in nonfunc-
tional processed pseudogenes but can, occasionally, give rise to
functional processed genes. There are at least eight human and
eight mouse genes for which evidence strongly supports such an
origin
211
(see http://www-i®.uni-muenster.de/exapted-retrogenes/
tables.html). Many other intronless genes may have been created
in the same way.
Transposons have made other creative contributions to the
genome. A few hundred genes, for example, use transcriptional
terminators donated by LTR retroposons (data not shown). Other
genes employ regulatory elements derived from repeat elements
211
.
Simple sequence repeats
Simple sequence repeats (SSRs) are a rather different type of
repetitive structure that is common in the human genomeÐperfect
or slightly imperfect tandem repeats of a particular k-mer. SSRs with
a short repeat unit (n = 1±13 bases) are often termed microsa-
articles
888 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 13 Human genes derived from transposable elements
GenBank ID* Gene name Related transposon family² Possible fusion gene§ Newly recognized derivationk
nID 3150436 BC200 FLAM Alu³
pID 2330017 Telomerase non-LTR retrotransposon
pID 1196425 HERV-3 env Retroviridae/HERV-R³
pID 4773880 Syncytin Retroviridae/HERV-W³
pID 131827 RAG1 and 2 Tc1-like
pID 29863 CENP-B Tc1/Pogo
EST 2529718 Tc1/Pogo +
PID 10047247 Tc1/Pogo/Pogo +
EST 4524463 Tc1/Pogo/Pogo +
pID 4504807 Jerky Tc1/Pogo/Tigger
pID 7513096 JRKL Tc1/Pogo/Tigger
EST 5112721 Tc1/Pogo/Tigger +
EST 11097233 Tc1/Pogo/Tigger +
EST 6986275 Sancho Tc1/Pogo/Tigger
EST 8616450 Tc1/Pogo/Tigger +
EST 8750408 Tc1/Pogo/Tigger +
EST 5177004 Tc1/Pogo/Tigger +
PID 3413884 KIAA0461 Tc1/Pogo/Tc2 +
PID 7959287 KIAA1513 Tc1/Pogo/Tc2 +
PID 2231380 Tc1/Mariner/Hsmar1³ +
EST 10219887 hAT/Hobo + +
PID 6581095 Buster1 hAT/Charlie +
PID 7243087 Buster2 hAT/Charlie +
PID 6581097 Buster3 hAT/Charlie
PID 7662294 KIAA0766 hAT/Charlie +
PID 10439678 hAT/Charlie +
PID 7243087 KIAA1353 hAT/Charlie +
PID 7021900 hAT/Charlie/Charlie3³ +
PID 4263748 hAT/Charlie/Charlie8³ +
EST 8161741 hAT/Charlie/Charlie9³ +
pID 4758872 DAP4,pP52
rIPK
hAT/Tip100/Zaphod
EST 10990063 hAT/Tip100/Zaphod +
EST 10101591 hAT/Tip100/Zaphod +
pID 7513011 KIAA0543 hAT/Tip100/Tip100 +
pID 10439744 hAT/Tip100/Tip100 +
pID 10047247 KIAA1586 hAT/Tip100/Tip100 +
pID 10439762 hAT/Tip100 + +
EST 10459804 hAT/Tip100 +
pID 4160548 Tramp hAT/Tam3 +
BAC 3522927 hAT/Tam3 +
pID 3327088 KIAA0637 hAT/Tam3 +
EST 1928552 hAT/Tam3 +
pID 6453533 piggyBac/MER85³ +
EST 3594004 piggyBac/MER85³ +
BAC 4309921 piggyBac/MER85³ +
EST 4073914 piggyBac/MER75³ +
EST 1963278 piggyBac +
...................................................................................................................................................................................................................................................................................................................................................................
The Table lists 47 human genes, with a likely origin in up to 38 different transposon copies.
* Where available, the GenBank ID numbers are given for proteins, otherwise a representative EST or a clone name is shown. Six groups (two or three genes each) have similarity at the DNA level well beyond
that observed between different DNA transposon families in the genome; they are indicated in italics, with all but the initial member of each group indented. This could be explained if the genes were
paralogous (derived from a single inserted transposon and subsequently duplicated).
² Classi®cation of the transposon.
³ Indicates that the transposon from which the gene is derived is precisely known.
§ Proteins probably formed by fusion of a cellular and transposon gene; many have acquired zinc-®nger domains.
kNot previously reported as being derived from transposable element genes. The remaining genes can be found in refs 142, 209.
© 2001 Macmillan Magazines Ltd
Page 30
tellites, whereas those with longer repeat units (n = 14±500 bases)
are often termed minisatellites. With the exception of poly(A) tails
from reverse transcribed messages, SSRs are thought to arise by
slippage during DNA replication
212,213
.
We compiled a catalogue of all SSRs over a given length in the
human draft genome sequence, and studied their properties
(Table 14). SSRs comprise about 3% of the human genome, with
the greatest single contribution coming from dinucleotide repeats
(0.5%). (The precise criteria for the number of repeat units and the
extent of divergence allowed in an SSR affect the exact census, but
not the qualitative conclusions.)
There is approximately one SSR per 2 kb (the number of non-
overlapping tandem repeats is 437 per Mb). The catalogue con®rms
various properties of SSRs that have been inferred from sampling
approaches (Table 15). The most frequent dinucleotide repeats are
AC and AT (50 and 35% of dinucleotide repeats, respectively),
whereas AG repeats (15%) are less frequent and GC repeats (0.1%)
are greatly under-represented. The most frequent trinucleotides are
AAT and AAC (33% and 21%, respectively), whereas ACC (4.0%),
AGC (2.2%), ACT (1.4%) and ACG (0.1%) are relatively rare.
Overall, trinucleotide SSRs are much less frequent than dinucleotide
SSRs
214
.
SSRs have been extremely important in human genetic studies,
because they show a high degree of length polymorphism in the
human population owing to frequent slippage by DNA polymerase
during replication. Genetic markers based on SSRsÐparticularly
(CA)
n
repeatsÐhave been the workhorse of most human disease-
mapping studies
101,102
. The availability of a comprehensive catalogue
of SSRs is thus a boon for human genetic studies.
The SSR catalogue also allowed us to resolve a mystery regarding
mammalian genetic maps. Such genetic maps in rat, mouse and
human have a de®cit of polymorphic (CA)
n
repeats on chromosome
X
30,101
. There are two possible explanations for this de®cit. There
may simply be fewer (CA)
n
repeats on chromosome X; or (CA)
n
repeats may be as dense on chromosome X but less polymorphic in
the population. In fact, analysis of the draft genome sequence shows
that chromosome X has the same density of (CA)
n
repeats per Mb as
the autosomes (data not shown). Thus, the de®cit of polymorphic
markers relative to autosomes results from population genetic
forces. Possible explanations include that chromosome X has a
smaller effective population size, experiences more frequent selec-
tive sweeps reducing diversity (owing to its hemizygosity in males),
or has a lower mutation rate (owing to its more frequent passage
through the less mutagenic female germline). The availability of the
draft genome sequence should provide ways to test these alternative
explanations.
Segmental duplications
A remarkable feature of the human genome is the segmental
duplication of portions of genomic sequence
215±217
. Such duplica-
tions involve the transfer of 1±200-kb blocks of genomic sequence
to one or more locations in the genome. The locations of both
donor and recipient regions of the genome are often not tandemly
arranged, suggesting mechanisms other than unequal crossing-over
for their origin. They are relatively recent, inasmuch as strong
sequence identity is seen in both exons and introns (in contrast to
regions that are considered to show evidence of ancient duplica-
tions, characterized by similarities only in coding regions). Indeed,
many such duplications appear to have arisen in very recent
evolutionary time, as judged by high sequence identity and by
their absence in closely related species.
Segmental duplications can be divided into two categories. First,
interchromosomal duplications are de®ned as segments that are
duplicated among nonhomologous chromosomes. For example, a
9.5-kb genomic segment of the adrenoleukodystrophy locus from
Xq28 has been duplicated to regions near the centromeres of
chromosomes 2, 10, 16 and 22 (refs 218, 219). Anecdotal observations
suggest that many interchromosomal duplications map near the
centromeric and telomeric regions of human chromosomes
218±233
.
The second category is intrachromosomal duplications, which
occur within a particular chromosome or chromosomal arm. This
category includes several duplicated segments, also known as low
copy repeat sequences, that mediate recurrent chromosomal struc-
tural rearrangements associated with genetic disease
215,217
. Examples
on chromosome 17 include three copies of a roughly 200-kb repeat
separated by around 5 Mb and two copies of a roughly 24-kb repeat
separated by 1.5 Mb. The copies are so similar (99% identity) that
paralogous recombination events can occur, giving rise to contig-
uous gene syndromes: Smith±Magenis syndrome and Charcot±
Marie±Tooth syndrome 1A, respectively
34,234
. Several other exam-
ples are known and are also suspected to be responsible for recurrent
microdeletion syndromes (for example, Prader±Willi/Angelman,
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 889
Table 15 SSRs by repeat unit
Repeat unit Number of SSRs per Mb
AC 27.7
AT 19.4
AG 8.2
GC 0.1
AAT 4.1
AAC 2.6
AGG 1.5
AAG 1.4
ATG 0.7
CGG 0.6
ACC 0.4
AGC 0.3
ACT 0.2
ACG 0.0
.............................................................................................................................................................................
SSRs were identi®ed as in Table 14.
Table 14 SSR content of the human genome
Length of repeat unit Average bases per Mb Average number of SSR
elements per Mb
1 1,660 36.7
2 5,046 43.1
3 1,013 11.8
4 3,383 32.5
5 2,686 17.6
6 1,376 15.2
7 906 8.4
8 1,139 11.1
9 900 8.6
10 1,576 8.6
11 770 8.7
.............................................................................................................................................................................
SSRs were identi®ed by using the computer program Tandem Repeat Finder with the following
parameters: match score 2, mismatch score 3, indel 5, minimum alignment 50, maximum repeat
length 500, minimum repeat length 1.
Figure 30 Duplication landscape of chromosome 22. The size and location of
intrachromosomal (blue) and interchromosomal (red) duplications are depicted for
chromosome 22q, using the PARASIGHT computer program (Bailey and Eichler,
unpublished). Each horizontal line represents 1 Mb (ticks, 100-kb intervals). The
chromosome sequence is oriented from centromere (top left) to telomere (bottom right).
Pairwise alignments with . 90% nucleotide identity and . 1 kb long are shown. Gaps
within the chromosomal sequence are of known size and shown as empty space.
© 2001 Macmillan Magazines Ltd
are often termed minisatellites. With the exception of poly(A) tails
from reverse transcribed messages, SSRs are thought to arise by
slippage during DNA replication
212,213
.
We compiled a catalogue of all SSRs over a given length in the
human draft genome sequence, and studied their properties
(Table 14). SSRs comprise about 3% of the human genome, with
the greatest single contribution coming from dinucleotide repeats
(0.5%). (The precise criteria for the number of repeat units and the
extent of divergence allowed in an SSR affect the exact census, but
not the qualitative conclusions.)
There is approximately one SSR per 2 kb (the number of non-
overlapping tandem repeats is 437 per Mb). The catalogue con®rms
various properties of SSRs that have been inferred from sampling
approaches (Table 15). The most frequent dinucleotide repeats are
AC and AT (50 and 35% of dinucleotide repeats, respectively),
whereas AG repeats (15%) are less frequent and GC repeats (0.1%)
are greatly under-represented. The most frequent trinucleotides are
AAT and AAC (33% and 21%, respectively), whereas ACC (4.0%),
AGC (2.2%), ACT (1.4%) and ACG (0.1%) are relatively rare.
Overall, trinucleotide SSRs are much less frequent than dinucleotide
SSRs
214
.
SSRs have been extremely important in human genetic studies,
because they show a high degree of length polymorphism in the
human population owing to frequent slippage by DNA polymerase
during replication. Genetic markers based on SSRsÐparticularly
(CA)
n
repeatsÐhave been the workhorse of most human disease-
mapping studies
101,102
. The availability of a comprehensive catalogue
of SSRs is thus a boon for human genetic studies.
The SSR catalogue also allowed us to resolve a mystery regarding
mammalian genetic maps. Such genetic maps in rat, mouse and
human have a de®cit of polymorphic (CA)
n
repeats on chromosome
X
30,101
. There are two possible explanations for this de®cit. There
may simply be fewer (CA)
n
repeats on chromosome X; or (CA)
n
repeats may be as dense on chromosome X but less polymorphic in
the population. In fact, analysis of the draft genome sequence shows
that chromosome X has the same density of (CA)
n
repeats per Mb as
the autosomes (data not shown). Thus, the de®cit of polymorphic
markers relative to autosomes results from population genetic
forces. Possible explanations include that chromosome X has a
smaller effective population size, experiences more frequent selec-
tive sweeps reducing diversity (owing to its hemizygosity in males),
or has a lower mutation rate (owing to its more frequent passage
through the less mutagenic female germline). The availability of the
draft genome sequence should provide ways to test these alternative
explanations.
Segmental duplications
A remarkable feature of the human genome is the segmental
duplication of portions of genomic sequence
215±217
. Such duplica-
tions involve the transfer of 1±200-kb blocks of genomic sequence
to one or more locations in the genome. The locations of both
donor and recipient regions of the genome are often not tandemly
arranged, suggesting mechanisms other than unequal crossing-over
for their origin. They are relatively recent, inasmuch as strong
sequence identity is seen in both exons and introns (in contrast to
regions that are considered to show evidence of ancient duplica-
tions, characterized by similarities only in coding regions). Indeed,
many such duplications appear to have arisen in very recent
evolutionary time, as judged by high sequence identity and by
their absence in closely related species.
Segmental duplications can be divided into two categories. First,
interchromosomal duplications are de®ned as segments that are
duplicated among nonhomologous chromosomes. For example, a
9.5-kb genomic segment of the adrenoleukodystrophy locus from
Xq28 has been duplicated to regions near the centromeres of
chromosomes 2, 10, 16 and 22 (refs 218, 219). Anecdotal observations
suggest that many interchromosomal duplications map near the
centromeric and telomeric regions of human chromosomes
218±233
.
The second category is intrachromosomal duplications, which
occur within a particular chromosome or chromosomal arm. This
category includes several duplicated segments, also known as low
copy repeat sequences, that mediate recurrent chromosomal struc-
tural rearrangements associated with genetic disease
215,217
. Examples
on chromosome 17 include three copies of a roughly 200-kb repeat
separated by around 5 Mb and two copies of a roughly 24-kb repeat
separated by 1.5 Mb. The copies are so similar (99% identity) that
paralogous recombination events can occur, giving rise to contig-
uous gene syndromes: Smith±Magenis syndrome and Charcot±
Marie±Tooth syndrome 1A, respectively
34,234
. Several other exam-
ples are known and are also suspected to be responsible for recurrent
microdeletion syndromes (for example, Prader±Willi/Angelman,
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 889
Table 15 SSRs by repeat unit
Repeat unit Number of SSRs per Mb
AC 27.7
AT 19.4
AG 8.2
GC 0.1
AAT 4.1
AAC 2.6
AGG 1.5
AAG 1.4
ATG 0.7
CGG 0.6
ACC 0.4
AGC 0.3
ACT 0.2
ACG 0.0
.............................................................................................................................................................................
SSRs were identi®ed as in Table 14.
Table 14 SSR content of the human genome
Length of repeat unit Average bases per Mb Average number of SSR
elements per Mb
1 1,660 36.7
2 5,046 43.1
3 1,013 11.8
4 3,383 32.5
5 2,686 17.6
6 1,376 15.2
7 906 8.4
8 1,139 11.1
9 900 8.6
10 1,576 8.6
11 770 8.7
.............................................................................................................................................................................
SSRs were identi®ed by using the computer program Tandem Repeat Finder with the following
parameters: match score 2, mismatch score 3, indel 5, minimum alignment 50, maximum repeat
length 500, minimum repeat length 1.
Figure 30 Duplication landscape of chromosome 22. The size and location of
intrachromosomal (blue) and interchromosomal (red) duplications are depicted for
chromosome 22q, using the PARASIGHT computer program (Bailey and Eichler,
unpublished). Each horizontal line represents 1 Mb (ticks, 100-kb intervals). The
chromosome sequence is oriented from centromere (top left) to telomere (bottom right).
Pairwise alignments with . 90% nucleotide identity and . 1 kb long are shown. Gaps
within the chromosomal sequence are of known size and shown as empty space.
© 2001 Macmillan Magazines Ltd
Page 33
duplications may well be underestimated by the current analysis. An
understanding of the biology, pathology and evolution of these
duplications will require specialized efforts within these exceptional
regions of the human genome. The presence and distribution of
such segments may provide evolutionary fodder for processes of
exon shuf¯ing and a general increase in protein diversity associated
with domain accretion. It will be important to consider both
genome-wide duplication events and more restricted punctuated
events of genome duplication as forces in the evolution of vertebrate
genomes.
Gene content of the human genome
Genes (or at least their coding regions) comprise only a tiny fraction
of human DNA, but they represent the major biological function of
the genome and the main focus of interest by biologists. They are
also the most challenging feature to identify in the human genome
sequence.
The ultimate goal is to compile a complete list of all human genes
and their encoded proteins, to serve as a `periodic table' for
biomedical research
243
. But this is a dif®cult task. In organisms
with small genomes, it is straightforward to identify most genes by
the presence of long ORFs. In contrast, human genes tend to have
small exons (encoding an average of only 50 codons) separated by
long introns (some exceeding 10 kb). This creates a signal-to-noise
problem, with the result that computer programs for direct gene
prediction have only limited accuracy. Instead, computational
prediction of human genes must rely largely on the availability of
cDNA sequences or on sequence conservation with genes and
proteins from other organisms. This approach is adequate for
strongly conserved genes (such as histones or ubiquitin), but may
be less sensitive to rapidly evolving genes (including many crucial to
speciation, sex determination and fertilization).
Here we describe our efforts to recognize both the RNA genes and
protein-coding genes in the human genome. We also study the
properties of the predicted human protein set, attempting to discern
how the human proteome differs from those of invertebrates such as
worm and ¯y.
Noncoding RNAs
Although biologists often speak of a tight coupling between `genes
articles
892 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 18 Cross-species comparison for large, highly homologous segmen-
tal duplications
Percentage of genome (%)
Fly Worm Human (®nished)*
. 1 kb 1.2 4.25 3.25
. 5 kb 0.37 1.50 2.86
. 10 kb 0.08 0.66 2.52
.............................................................................................................................................................................
* This is an underestimate of the total amount of segmental duplication in the human genome
because it only re¯ects duplication detectable with available ®nished sequence. The proportion of
segmental duplications of . 1 kb is probably about 5% (see text).
a b
c d
Intrachromosomal Interchromosomal
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
Similarity (%)
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Length of alignment (kb)
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
90 91 92 93 94 95 96 97 98 9990.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
Similarity (%)
90 91 92 93 94 95 96 97 98 9990.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
1 3 4 5 6 7 8 9 10–19 20–29 30–39 40–492
Length of alignment (kb)
1 3 4 5 6 7 8 9 10–19 20–29 30–39 40–49 50+50+ 2
Figure 33 a±d, Sequence properties of segmental duplications. Distributions of length
and per cent nucleotide identity for segmental duplications are shown as a function of the
number of aligned bp, for the subset of ®nished genome sequence. Intrachromosomal,
red; interchromosomal, blue.
Table 17 Fraction of the draft genome sequence in inter- and intrachromo-
somal duplications
Chromosome Intrachromosomal (%) Interchromosomal (%) All (%)
1 2.1 1.7 3.4
2 1.6 1.6 2.6
3 1.8 1.4 2.7
4 1.5 2.2 3.0
5 1.0 0.9 1.8
6 1.5 1.4 2.7
7 3.6 1.8 4.5
8 1.2 1.5 2.1
9 2.1 2.3 3.8
10 3.3 2.0 4.7
11 2.7 1.4 3.7
12 2.1 1.2 2.8
13 1.7 1.6 3.0
14 0.6 0.6 1.2
15 4.1 4.4 6.7
16 3.4 3.4 5.5
17 4.4 1.7 5.7
18 0.9 1.0 1.9
19 5.4 1.6 6.3
20 0.8 1.4 2.0
21 1.9 4.0 4.8
22 6.8 7.7 11.9
X 1.2 1.1 2.2
Y 10.9 13.1 20.8
NA 2.3 7.8 8.3
UL 11.6 20.8 22.2
Total 2.3 2.0 3.6
.............................................................................................................................................................................
Excludes duplications with identities .98% to avoid artefactual duplication due to incomplete
merger in the assembly process. Calculation was performed on an earlier version of the draft
genome sequence based on data available in July 2000 and re¯ects the duplications found within
the total amount of ®nished sequence then. Note that there is some overlap between the
interchromosomal and intrachromosomal sets.
© 2001 Macmillan Magazines Ltd
understanding of the biology, pathology and evolution of these
duplications will require specialized efforts within these exceptional
regions of the human genome. The presence and distribution of
such segments may provide evolutionary fodder for processes of
exon shuf¯ing and a general increase in protein diversity associated
with domain accretion. It will be important to consider both
genome-wide duplication events and more restricted punctuated
events of genome duplication as forces in the evolution of vertebrate
genomes.
Gene content of the human genome
Genes (or at least their coding regions) comprise only a tiny fraction
of human DNA, but they represent the major biological function of
the genome and the main focus of interest by biologists. They are
also the most challenging feature to identify in the human genome
sequence.
The ultimate goal is to compile a complete list of all human genes
and their encoded proteins, to serve as a `periodic table' for
biomedical research
243
. But this is a dif®cult task. In organisms
with small genomes, it is straightforward to identify most genes by
the presence of long ORFs. In contrast, human genes tend to have
small exons (encoding an average of only 50 codons) separated by
long introns (some exceeding 10 kb). This creates a signal-to-noise
problem, with the result that computer programs for direct gene
prediction have only limited accuracy. Instead, computational
prediction of human genes must rely largely on the availability of
cDNA sequences or on sequence conservation with genes and
proteins from other organisms. This approach is adequate for
strongly conserved genes (such as histones or ubiquitin), but may
be less sensitive to rapidly evolving genes (including many crucial to
speciation, sex determination and fertilization).
Here we describe our efforts to recognize both the RNA genes and
protein-coding genes in the human genome. We also study the
properties of the predicted human protein set, attempting to discern
how the human proteome differs from those of invertebrates such as
worm and ¯y.
Noncoding RNAs
Although biologists often speak of a tight coupling between `genes
articles
892 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 18 Cross-species comparison for large, highly homologous segmen-
tal duplications
Percentage of genome (%)
Fly Worm Human (®nished)*
. 1 kb 1.2 4.25 3.25
. 5 kb 0.37 1.50 2.86
. 10 kb 0.08 0.66 2.52
.............................................................................................................................................................................
* This is an underestimate of the total amount of segmental duplication in the human genome
because it only re¯ects duplication detectable with available ®nished sequence. The proportion of
segmental duplications of . 1 kb is probably about 5% (see text).
a b
c d
Intrachromosomal Interchromosomal
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
Similarity (%)
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
S
um
o
f a
lig
ne
d
ba
se
s
(k
b)
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Length of alignment (kb)
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
90 91 92 93 94 95 96 97 98 9990.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
Similarity (%)
90 91 92 93 94 95 96 97 98 9990.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
1 3 4 5 6 7 8 9 10–19 20–29 30–39 40–492
Length of alignment (kb)
1 3 4 5 6 7 8 9 10–19 20–29 30–39 40–49 50+50+ 2
Figure 33 a±d, Sequence properties of segmental duplications. Distributions of length
and per cent nucleotide identity for segmental duplications are shown as a function of the
number of aligned bp, for the subset of ®nished genome sequence. Intrachromosomal,
red; interchromosomal, blue.
Table 17 Fraction of the draft genome sequence in inter- and intrachromo-
somal duplications
Chromosome Intrachromosomal (%) Interchromosomal (%) All (%)
1 2.1 1.7 3.4
2 1.6 1.6 2.6
3 1.8 1.4 2.7
4 1.5 2.2 3.0
5 1.0 0.9 1.8
6 1.5 1.4 2.7
7 3.6 1.8 4.5
8 1.2 1.5 2.1
9 2.1 2.3 3.8
10 3.3 2.0 4.7
11 2.7 1.4 3.7
12 2.1 1.2 2.8
13 1.7 1.6 3.0
14 0.6 0.6 1.2
15 4.1 4.4 6.7
16 3.4 3.4 5.5
17 4.4 1.7 5.7
18 0.9 1.0 1.9
19 5.4 1.6 6.3
20 0.8 1.4 2.0
21 1.9 4.0 4.8
22 6.8 7.7 11.9
X 1.2 1.1 2.2
Y 10.9 13.1 20.8
NA 2.3 7.8 8.3
UL 11.6 20.8 22.2
Total 2.3 2.0 3.6
.............................................................................................................................................................................
Excludes duplications with identities .98% to avoid artefactual duplication due to incomplete
merger in the assembly process. Calculation was performed on an earlier version of the draft
genome sequence based on data available in July 2000 and re¯ects the duplications found within
the total amount of ®nished sequence then. Note that there is some overlap between the
interchromosomal and intrachromosomal sets.
© 2001 Macmillan Magazines Ltd
Page 34
and their encoded protein products', it is important to remember
that thousands of human genes produce noncoding RNAs
(ncRNAs) as their ultimate product
244
. There are several major
classes of ncRNA. (1) Transfer RNAs (tRNAs) are the adapters that
translate the triplet nucleic acid code of RNA into the amino-acid
sequence of proteins; (2) ribosomal RNAs (rRNAs) are also central
to the translational machinery, and recent X-ray crystallography
results strongly indicate that peptide bond formation is catalysed by
rRNA, not protein
245,246
; (3) small nucleolar RNAs (snoRNAs) are
required for rRNA processing and base modi®cation in the
nucleolus
247,248
; and (4) small nuclear RNAs (snRNAs) are critical
components of spliceosomes, the large ribonucleoprotein (RNP)
complexes that splice introns out of pre-mRNAs in the nucleus.
Humans have both a major, U2 snRNA-dependent spliceosome that
splices most introns, and a minor, U12 snRNA-dependent spliceo-
some that splices a rare class of introns that often have AT/AC
dinucleotides at the splice sites instead of the canonical GT/AG
splice site consensus
249
.
Other ncRNAs include both RNAs of known biochemical func-
tion (such as telomerase RNA and the 7SL signal recognition
particle RNA) and ncRNAs of enigmatic function (such as the
large Xist transcript implicated in X dosage compensation
250
, or the
small vault RNAs found in the bizarre vault ribonucleoprotein
complex
251
, which is three times the mass of the ribosome but has
unknown function).
ncRNAs do not have translated ORFs, are often small and are not
polyadenylated. Accordingly, novel ncRNAs cannot readily be
found by computational gene-®nding techniques (which search
for features such as ORFs) or experimental sequencing of cDNA or
EST libraries (most of which are prepared by reverse transcription
using a primer complementary to a poly(A) tail). Even if the
complete ®nished sequence of the human genome were available,
discovering novel ncRNAs would still be challenging. We can,
however, identify genomic sequences that are homologous to
known ncRNA genes, using BLASTN or, in some cases, more
specialized methods.
It is sometimes dif®cult to tell whether such homologous genes
are orthologues, paralogues or closely related pseudogenes (because
inactivating mutations are much less obvious than for protein-
coding genes). For tRNA, there is suf®ciently detailed information
about the cloverleaf secondary structure to allow true genes and
pseudogenes to be distinguished with high sensitivity. For many
other ncRNAs, there is much less structural information and so we
employ an operational criterion of high sequence similarity (. 95%
sequence identity and . 95% full length) to distinguish true genes
from pseudogenes. These assignments will eventually need to be
reconciled with experimental data.
Transfer RNA genes. The classical experimental estimate of the
number of human tRNA genes is 1,310 (ref. 252). In the draft
genome sequence, we ®nd only 497 human tRNA genes (Tables 19,
20). How do we account for this discrepancy? We believe that the
original estimate is likely to have been in¯ated in two respects. First,
it came from a hybridization experiment that probably counted
closely related pseudogenes; by analysis of the draft genome
sequence, there are in fact 324 tRNA-derived putative pseudogenes
(Table 20). Second, the earlier estimate assumed too high a value for
the size of the human genome; repeating the calculation using the
correct value yields an estimate of about 890 tRNA-related loci,
which is in reasonable accord with our count of 821 tRNA genes and
pseudogenes in the draft genome sequence.
The human tRNA gene set predicted from the draft genome
sequence appears to include most of the known human tRNA
species. The draft genome sequence contains 37 of 38 human
tRNA species listed in a tRNA database
253
, allowing for up to one
mismatch. This includes one copy of the known gene for a
specialized selenocysteine tRNA, one of several components of a
baroque translational mechanism that reads UGA as a selenocys-
teine codon in certain rare mRNAs that carry a speci®c cis-acting
RNA regulatory site (a so-called SECIS element) in their 39 UTRs.
The one tRNA gene in the database not found in the draft genome
sequence is DE9990, a tRNAGlu species, which differs in two
positions from the most related tRNA gene in the human
genome. Possible explanations are that the database version of
this tRNA contains two errors, the gene is polymorphic or this is
a genuine functional tRNA that is missing from the draft genome
sequence. (The database also lists one additional tRNA gene
(DS9994), but this is apparently a contaminant, most similar to
bacterial tRNAs; the parent entry (Z13399) was withdrawn from the
DNA database, but the tRNA entry has not yet been removed from
the tRNA database.) Although the human set appears substantially
complete by this test, the tRNA gene numbers in Table 19 should be
considered tentative and used with caution. The human and ¯y (but
not the worm) are known to be missing signi®cant amounts of
heterochromatic DNA, and additional tRNA genes could be located
there.
With this caveat, the results indicate that the human has fewer
tRNA genes than the worm, but more than the ¯y. This may seem
surprising, but tRNA gene number in metazoans is thought to be
related not to organismal complexity, but more to idiosyncrasies of
the demand for tRNA abundance in certain tissues or stages of
embryonic development. For example, the frog Xenopus laevis,
which must load each oocyte with a remarkable 40 ng of tRNA,
has thousands of tRNA genes
254
.
The degeneracy of the genetic code has allowed an inspired
economy of tRNA anticodon usage. Although 61 sense codons
need to be decoded, not all 61 different anticodons are present in
tRNAs. Rather, tRNAs generally follow stereotyped and conserved
wobble rules
255±257
. Wobble reduces the number of required anti-
codons substantially, and provides a connection between the genetic
code and the hybridization stability of modi®ed and unmodi®ed
RNA bases. In eukaryotes, the rules proposed by Guthrie and
Abelson
256
predict that about 46 tRNA species will be suf®cient to
read the 61 sense codons (counting the initiator and elongator
methionine tRNAs as two species). According to these rules, in the
codon's third (wobble) position, U and C are generally decoded by a
single tRNA species, whereas A and G are decoded by two separate
tRNA species.
In `two-codon boxes' of the genetic code (where codons ending
with U/C encode a different amino acid from those ending with
A/G), the U/C wobble position should be decoded by a G at position
34 in the tRNA anticodon. Thus, in the top left of Fig. 34, there is no
tRNA with an AAA anticodon for Phe, but the GAA anticodon can
recognize both UUU and UUC codons in the mRNA. In `four-
codon boxes' of the genetic code (where U, C, A and G in the wobble
position all encode the same amino acid), the U/C wobble position
is almost always decoded by I34 (inosine) in the tRNA, where the
inosine is produced by post-transcriptional modi®cation of an
adenine (A). In the bottom left of Fig. 34, for example, the GUU
and GUC codons of the four-codon Val box are decoded by a tRNA
with an anticodon of AAC, which is no doubt modi®ed to IAC.
Presumably this pattern, which is strikingly conserved in eukar-
yotes, has to do with the fact that IA base pairs are also possible; thus
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 893
Table 19 Number of tRNA genes in various organisms
Organism Number of canonical tRNAs SeCys tRNA
Human 497 1
Worm 584 1
Fly 284 1
Yeast 273 0
Methanococcus jannaschii 36 1
Escherichia coli 86 1
.............................................................................................................................................................................
Number of tRNA genes in each of six genome sequences, according to analysis by the computer
program tRNAscan-SE . Canonical tRNAs read one of the standard 61 sense codons; this category
excludes pseudogenes, undetermined anticodons, putative supressors and selenocysteine tRNAs.
Most organisms have a selenocysteine (SeCys) tRNA species, but some unicellular eukaryotes do
not (such as the yeast S. cerevisiae).
© 2001 Macmillan Magazines Ltd
that thousands of human genes produce noncoding RNAs
(ncRNAs) as their ultimate product
244
. There are several major
classes of ncRNA. (1) Transfer RNAs (tRNAs) are the adapters that
translate the triplet nucleic acid code of RNA into the amino-acid
sequence of proteins; (2) ribosomal RNAs (rRNAs) are also central
to the translational machinery, and recent X-ray crystallography
results strongly indicate that peptide bond formation is catalysed by
rRNA, not protein
245,246
; (3) small nucleolar RNAs (snoRNAs) are
required for rRNA processing and base modi®cation in the
nucleolus
247,248
; and (4) small nuclear RNAs (snRNAs) are critical
components of spliceosomes, the large ribonucleoprotein (RNP)
complexes that splice introns out of pre-mRNAs in the nucleus.
Humans have both a major, U2 snRNA-dependent spliceosome that
splices most introns, and a minor, U12 snRNA-dependent spliceo-
some that splices a rare class of introns that often have AT/AC
dinucleotides at the splice sites instead of the canonical GT/AG
splice site consensus
249
.
Other ncRNAs include both RNAs of known biochemical func-
tion (such as telomerase RNA and the 7SL signal recognition
particle RNA) and ncRNAs of enigmatic function (such as the
large Xist transcript implicated in X dosage compensation
250
, or the
small vault RNAs found in the bizarre vault ribonucleoprotein
complex
251
, which is three times the mass of the ribosome but has
unknown function).
ncRNAs do not have translated ORFs, are often small and are not
polyadenylated. Accordingly, novel ncRNAs cannot readily be
found by computational gene-®nding techniques (which search
for features such as ORFs) or experimental sequencing of cDNA or
EST libraries (most of which are prepared by reverse transcription
using a primer complementary to a poly(A) tail). Even if the
complete ®nished sequence of the human genome were available,
discovering novel ncRNAs would still be challenging. We can,
however, identify genomic sequences that are homologous to
known ncRNA genes, using BLASTN or, in some cases, more
specialized methods.
It is sometimes dif®cult to tell whether such homologous genes
are orthologues, paralogues or closely related pseudogenes (because
inactivating mutations are much less obvious than for protein-
coding genes). For tRNA, there is suf®ciently detailed information
about the cloverleaf secondary structure to allow true genes and
pseudogenes to be distinguished with high sensitivity. For many
other ncRNAs, there is much less structural information and so we
employ an operational criterion of high sequence similarity (. 95%
sequence identity and . 95% full length) to distinguish true genes
from pseudogenes. These assignments will eventually need to be
reconciled with experimental data.
Transfer RNA genes. The classical experimental estimate of the
number of human tRNA genes is 1,310 (ref. 252). In the draft
genome sequence, we ®nd only 497 human tRNA genes (Tables 19,
20). How do we account for this discrepancy? We believe that the
original estimate is likely to have been in¯ated in two respects. First,
it came from a hybridization experiment that probably counted
closely related pseudogenes; by analysis of the draft genome
sequence, there are in fact 324 tRNA-derived putative pseudogenes
(Table 20). Second, the earlier estimate assumed too high a value for
the size of the human genome; repeating the calculation using the
correct value yields an estimate of about 890 tRNA-related loci,
which is in reasonable accord with our count of 821 tRNA genes and
pseudogenes in the draft genome sequence.
The human tRNA gene set predicted from the draft genome
sequence appears to include most of the known human tRNA
species. The draft genome sequence contains 37 of 38 human
tRNA species listed in a tRNA database
253
, allowing for up to one
mismatch. This includes one copy of the known gene for a
specialized selenocysteine tRNA, one of several components of a
baroque translational mechanism that reads UGA as a selenocys-
teine codon in certain rare mRNAs that carry a speci®c cis-acting
RNA regulatory site (a so-called SECIS element) in their 39 UTRs.
The one tRNA gene in the database not found in the draft genome
sequence is DE9990, a tRNAGlu species, which differs in two
positions from the most related tRNA gene in the human
genome. Possible explanations are that the database version of
this tRNA contains two errors, the gene is polymorphic or this is
a genuine functional tRNA that is missing from the draft genome
sequence. (The database also lists one additional tRNA gene
(DS9994), but this is apparently a contaminant, most similar to
bacterial tRNAs; the parent entry (Z13399) was withdrawn from the
DNA database, but the tRNA entry has not yet been removed from
the tRNA database.) Although the human set appears substantially
complete by this test, the tRNA gene numbers in Table 19 should be
considered tentative and used with caution. The human and ¯y (but
not the worm) are known to be missing signi®cant amounts of
heterochromatic DNA, and additional tRNA genes could be located
there.
With this caveat, the results indicate that the human has fewer
tRNA genes than the worm, but more than the ¯y. This may seem
surprising, but tRNA gene number in metazoans is thought to be
related not to organismal complexity, but more to idiosyncrasies of
the demand for tRNA abundance in certain tissues or stages of
embryonic development. For example, the frog Xenopus laevis,
which must load each oocyte with a remarkable 40 ng of tRNA,
has thousands of tRNA genes
254
.
The degeneracy of the genetic code has allowed an inspired
economy of tRNA anticodon usage. Although 61 sense codons
need to be decoded, not all 61 different anticodons are present in
tRNAs. Rather, tRNAs generally follow stereotyped and conserved
wobble rules
255±257
. Wobble reduces the number of required anti-
codons substantially, and provides a connection between the genetic
code and the hybridization stability of modi®ed and unmodi®ed
RNA bases. In eukaryotes, the rules proposed by Guthrie and
Abelson
256
predict that about 46 tRNA species will be suf®cient to
read the 61 sense codons (counting the initiator and elongator
methionine tRNAs as two species). According to these rules, in the
codon's third (wobble) position, U and C are generally decoded by a
single tRNA species, whereas A and G are decoded by two separate
tRNA species.
In `two-codon boxes' of the genetic code (where codons ending
with U/C encode a different amino acid from those ending with
A/G), the U/C wobble position should be decoded by a G at position
34 in the tRNA anticodon. Thus, in the top left of Fig. 34, there is no
tRNA with an AAA anticodon for Phe, but the GAA anticodon can
recognize both UUU and UUC codons in the mRNA. In `four-
codon boxes' of the genetic code (where U, C, A and G in the wobble
position all encode the same amino acid), the U/C wobble position
is almost always decoded by I34 (inosine) in the tRNA, where the
inosine is produced by post-transcriptional modi®cation of an
adenine (A). In the bottom left of Fig. 34, for example, the GUU
and GUC codons of the four-codon Val box are decoded by a tRNA
with an anticodon of AAC, which is no doubt modi®ed to IAC.
Presumably this pattern, which is strikingly conserved in eukar-
yotes, has to do with the fact that IA base pairs are also possible; thus
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 893
Table 19 Number of tRNA genes in various organisms
Organism Number of canonical tRNAs SeCys tRNA
Human 497 1
Worm 584 1
Fly 284 1
Yeast 273 0
Methanococcus jannaschii 36 1
Escherichia coli 86 1
.............................................................................................................................................................................
Number of tRNA genes in each of six genome sequences, according to analysis by the computer
program tRNAscan-SE . Canonical tRNAs read one of the standard 61 sense codons; this category
excludes pseudogenes, undetermined anticodons, putative supressors and selenocysteine tRNAs.
Most organisms have a selenocysteine (SeCys) tRNA species, but some unicellular eukaryotes do
not (such as the yeast S. cerevisiae).
© 2001 Macmillan Magazines Ltd
Page 35
the IAC anticodon for a Val tRNA could recognize GUU, GUC and
even GUA codons. Were this same I34 to be utilized in two-codon
boxes, however, misreading of the NNA codon would occur, result-
ing in translational havoc. Eukaryotic glycine tRNAs represent a
conserved exception to this last rule; they use a GCC anticodon to
decode GGU and GGC, rather than the expected ICC anticodon.
Satisfyingly, the human tRNA set follows these wobble rules
almost perfectly (Fig. 34). Only three unexpected tRNA species
are found: single genes for a tRNATyr-AUA, tRNAIle-GAU, and
tRNAAsn-AUU. Perhaps these are pseudogenes, but they appear to
be plausible tRNAs. We also checked the possibility of sequencing
errors in their anticodons, but each of these three genes is in a region
of high sequence accuracy, with PHRAP quality scores higher
than 70 for every base in their anticodons.
As in all other organisms, human protein-coding genes show
codon biasÐpreferential use of one synonymous codon over
another
258
(Fig. 34). In less complex organisms, such as yeast or
bacteria, highly expressed genes show the strongest codon bias.
Cytoplasmic abundance of tRNA species is correlated with both
codon bias and overall amino-acid frequency (for example, tRNAs
for preferred codons and for more common amino acids are more
abundant). This is presumably driven by selective pressure for
ef®cient or accurate translation
259
. In many organisms, tRNA
abundance in turn appears to be roughly correlated with tRNA
gene copy number, so tRNA gene copy number has been used as a
proxy for tRNA abundance
260
. In vertebrates, however, codon bias is
not so obviously correlated with gene expression level. Differing
codon biases between human genes is more a function of their
location in regions of different GC composition
261
. In agreement
with the literature, we see only a very rough correlation of human
tRNA gene number with either amino-acid frequency or codon bias
(Fig. 34). The most obvious outliers in these weak correlations are
the strongly preferred CUG leucine codon, with a mere six tRNA-
Leu-CAG genes producing a tRNA to decode it, and the relatively
rare cysteine UGU and UGC codons, with 30 tRNA genes to decode
them.
The tRNA genes are dispersed throughout the human genome.
However, this dispersal is nonrandom. tRNA genes have sometimes
been seen in clusters at small scales
262,263
but we can now see striking
clustering on a genome-wide scale. More than 25% of the tRNA
genes (140) are found in a region of only about 4 Mb on chromo-
some 6. This small region, only about 0.1% of the genome, contains
an almost suf®cient set of tRNA genes all by itself. The 140 tRNA
genes contain a representative for 36 of the 49 anticodons found in
the complete set; and of the 21 isoacceptor types, only tRNAs to
decode Asn, Cys, Glu and selenocysteine are missing. Many of these
tRNA genes, meanwhile, are clustered elsewhere; 18 of the 30 Cys
tRNAs are found in a 0.5-Mb stretch of chromosome 7 and many of
the Asn and Glu tRNA genes are loosely clustered on chromosome 1.
More than half of the tRNA genes (280 out of 497) reside on either
chromosome 1 or chromosome 6. Chromosomes 3, 4, 8, 9, 10, 12,
18, 20, 21 and X appear to have fewer than 10 tRNA genes each; and
chromosomes 22 and Y have none at all (each has a single
pseudogene).
Ribosomal RNA genes. The ribosome, the protein synthetic
machine of the cell, is made up of two subunits and contains four
rRNA species and many proteins. The large ribosomal subunit
contains 28S and 5.8S rRNAs (collectively called `large subunit'
(LSU) rRNA) and also a 5S rRNA. The small ribosomal subunit
contains 18S rRNA (`small subunit' (SSU) rRNA). The genes for
LSU and SSU rRNA occur in the human genome as a 44-kb tandem
repeat unit
264
. There are thought to be about 150±200 copies of this
repeat unit arrayed on the short arms of acrocentric chromosomes
13, 14, 15, 21 and 22 (refs 254, 264). There are no true complete
articles
894 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
UUU
UUC
UUA
UUG
171
203
73
125
AAA
GAA
UAA
CAA
Phe
Leu
0
14
8
6
UCU
UCC
UCA
UCG
147
172
118
45
AGA
GGA
UGA
CGA
Ser
10
0
5
4
UAU
UAC
UAA
UAG
124
158
0
0
AUA
GUA
UUA
CUA
1
11
0
0
Tyr
stop
stop
UGU
UGC
UGA
UGG
99
119
0
122
ACA
GCA
UCA
CCA
0
30
0
7
Cys
stop
Trp
CUU
CUC
CUA
CUG
127
187
69
392
AAG
GAG
UAG
CAG
Leu
13
0
2
6
CCU
CCC
CCA
CCG
175
197
170
69
AGG
GGG
UGG
CGG
Pro
11
0
10
4
CAU
CAC
CAA
CAG
104
147
121
343
AUG
GUG
UUG
CUG
His
Gln
0
12
11
21
CGU
CGC
CGA
CGG
47
107
63
115
ACG
GCG
UCG
CCG
Arg
9
0
7
5
AUU
AUC
AUA
AUG
165
218
71
221
AAU
GAU
UAU
CAU
Ile
Met
13
1
5
17
ACU
ACC
ACA
ACG
131
192
150
63
AGU
GGU
UGU
CGU
Thr
8
0
10
7
AAU
AAC
AAA
AAG
174
199
248
331
AUU
GUU
UUU
CUU
Asn
Lys
1
33
16
22
AGU
AGC
AGA
AGG
121
191
113
110
ACU
GCU
UCU
CCU
Ser
Arg
0
7
5
4
GUU
GUC
GUA
GUG
111
146
72
288
AAC
GAC
UAC
CAC
Val
20
0
5
19
GCU
GCC
GCA
GCG
185
282
160
74
AGC
GGC
UGC
CGC
Ala
25
0
10
5
GAU
GAC
GAA
GAG
230
262
301
404
AUC
GUC
UUC
CUC
Asp
Glu
0
10
14
8
GGU
GGC
GGA
GGG
112
230
168
160
ACC
GCC
UCC
CCC
Gly
0
11
5
8
Figure 34 The human genetic code and associated tRNA genes. For each of the 64
codons, we show: the corresponding amino acid; the observed frequency of the codon per
10,000 codons; the codon; predicted wobble pairing to a tRNA anticodon (black lines); an
unmodi®ed tRNA anticodon sequence; and the number of tRNA genes found with this
anticodon. For example, phenylalanine is encoded by UUU or UUC; UUC is seen more
frequently, 203 to 171 occurrences per 10,000 total codons; both codons are expected to
be decoded by a single tRNA anticodon type, GAA, using a G/U wobble; and there are 14
tRNA genes found with this anticodon. The modi®ed anticodon sequence in the mature
tRNA is not shown, even where post-transcriptional modi®cations can be con®dently
predicted (for example, when an A is used to decode a U/C third position, the A is almost
certainly an inosine in the mature tRNA). The Figure also does not show the number of
distinct tRNA species (such as distinct sequence families) for each anticodon; often there
is more than one species for each anticodon.
© 2001 Macmillan Magazines Ltd
even GUA codons. Were this same I34 to be utilized in two-codon
boxes, however, misreading of the NNA codon would occur, result-
ing in translational havoc. Eukaryotic glycine tRNAs represent a
conserved exception to this last rule; they use a GCC anticodon to
decode GGU and GGC, rather than the expected ICC anticodon.
Satisfyingly, the human tRNA set follows these wobble rules
almost perfectly (Fig. 34). Only three unexpected tRNA species
are found: single genes for a tRNATyr-AUA, tRNAIle-GAU, and
tRNAAsn-AUU. Perhaps these are pseudogenes, but they appear to
be plausible tRNAs. We also checked the possibility of sequencing
errors in their anticodons, but each of these three genes is in a region
of high sequence accuracy, with PHRAP quality scores higher
than 70 for every base in their anticodons.
As in all other organisms, human protein-coding genes show
codon biasÐpreferential use of one synonymous codon over
another
258
(Fig. 34). In less complex organisms, such as yeast or
bacteria, highly expressed genes show the strongest codon bias.
Cytoplasmic abundance of tRNA species is correlated with both
codon bias and overall amino-acid frequency (for example, tRNAs
for preferred codons and for more common amino acids are more
abundant). This is presumably driven by selective pressure for
ef®cient or accurate translation
259
. In many organisms, tRNA
abundance in turn appears to be roughly correlated with tRNA
gene copy number, so tRNA gene copy number has been used as a
proxy for tRNA abundance
260
. In vertebrates, however, codon bias is
not so obviously correlated with gene expression level. Differing
codon biases between human genes is more a function of their
location in regions of different GC composition
261
. In agreement
with the literature, we see only a very rough correlation of human
tRNA gene number with either amino-acid frequency or codon bias
(Fig. 34). The most obvious outliers in these weak correlations are
the strongly preferred CUG leucine codon, with a mere six tRNA-
Leu-CAG genes producing a tRNA to decode it, and the relatively
rare cysteine UGU and UGC codons, with 30 tRNA genes to decode
them.
The tRNA genes are dispersed throughout the human genome.
However, this dispersal is nonrandom. tRNA genes have sometimes
been seen in clusters at small scales
262,263
but we can now see striking
clustering on a genome-wide scale. More than 25% of the tRNA
genes (140) are found in a region of only about 4 Mb on chromo-
some 6. This small region, only about 0.1% of the genome, contains
an almost suf®cient set of tRNA genes all by itself. The 140 tRNA
genes contain a representative for 36 of the 49 anticodons found in
the complete set; and of the 21 isoacceptor types, only tRNAs to
decode Asn, Cys, Glu and selenocysteine are missing. Many of these
tRNA genes, meanwhile, are clustered elsewhere; 18 of the 30 Cys
tRNAs are found in a 0.5-Mb stretch of chromosome 7 and many of
the Asn and Glu tRNA genes are loosely clustered on chromosome 1.
More than half of the tRNA genes (280 out of 497) reside on either
chromosome 1 or chromosome 6. Chromosomes 3, 4, 8, 9, 10, 12,
18, 20, 21 and X appear to have fewer than 10 tRNA genes each; and
chromosomes 22 and Y have none at all (each has a single
pseudogene).
Ribosomal RNA genes. The ribosome, the protein synthetic
machine of the cell, is made up of two subunits and contains four
rRNA species and many proteins. The large ribosomal subunit
contains 28S and 5.8S rRNAs (collectively called `large subunit'
(LSU) rRNA) and also a 5S rRNA. The small ribosomal subunit
contains 18S rRNA (`small subunit' (SSU) rRNA). The genes for
LSU and SSU rRNA occur in the human genome as a 44-kb tandem
repeat unit
264
. There are thought to be about 150±200 copies of this
repeat unit arrayed on the short arms of acrocentric chromosomes
13, 14, 15, 21 and 22 (refs 254, 264). There are no true complete
articles
894 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
UUU
UUC
UUA
UUG
171
203
73
125
AAA
GAA
UAA
CAA
Phe
Leu
0
14
8
6
UCU
UCC
UCA
UCG
147
172
118
45
AGA
GGA
UGA
CGA
Ser
10
0
5
4
UAU
UAC
UAA
UAG
124
158
0
0
AUA
GUA
UUA
CUA
1
11
0
0
Tyr
stop
stop
UGU
UGC
UGA
UGG
99
119
0
122
ACA
GCA
UCA
CCA
0
30
0
7
Cys
stop
Trp
CUU
CUC
CUA
CUG
127
187
69
392
AAG
GAG
UAG
CAG
Leu
13
0
2
6
CCU
CCC
CCA
CCG
175
197
170
69
AGG
GGG
UGG
CGG
Pro
11
0
10
4
CAU
CAC
CAA
CAG
104
147
121
343
AUG
GUG
UUG
CUG
His
Gln
0
12
11
21
CGU
CGC
CGA
CGG
47
107
63
115
ACG
GCG
UCG
CCG
Arg
9
0
7
5
AUU
AUC
AUA
AUG
165
218
71
221
AAU
GAU
UAU
CAU
Ile
Met
13
1
5
17
ACU
ACC
ACA
ACG
131
192
150
63
AGU
GGU
UGU
CGU
Thr
8
0
10
7
AAU
AAC
AAA
AAG
174
199
248
331
AUU
GUU
UUU
CUU
Asn
Lys
1
33
16
22
AGU
AGC
AGA
AGG
121
191
113
110
ACU
GCU
UCU
CCU
Ser
Arg
0
7
5
4
GUU
GUC
GUA
GUG
111
146
72
288
AAC
GAC
UAC
CAC
Val
20
0
5
19
GCU
GCC
GCA
GCG
185
282
160
74
AGC
GGC
UGC
CGC
Ala
25
0
10
5
GAU
GAC
GAA
GAG
230
262
301
404
AUC
GUC
UUC
CUC
Asp
Glu
0
10
14
8
GGU
GGC
GGA
GGG
112
230
168
160
ACC
GCC
UCC
CCC
Gly
0
11
5
8
Figure 34 The human genetic code and associated tRNA genes. For each of the 64
codons, we show: the corresponding amino acid; the observed frequency of the codon per
10,000 codons; the codon; predicted wobble pairing to a tRNA anticodon (black lines); an
unmodi®ed tRNA anticodon sequence; and the number of tRNA genes found with this
anticodon. For example, phenylalanine is encoded by UUU or UUC; UUC is seen more
frequently, 203 to 171 occurrences per 10,000 total codons; both codons are expected to
be decoded by a single tRNA anticodon type, GAA, using a G/U wobble; and there are 14
tRNA genes found with this anticodon. The modi®ed anticodon sequence in the mature
tRNA is not shown, even where post-transcriptional modi®cations can be con®dently
predicted (for example, when an A is used to decode a U/C third position, the A is almost
certainly an inosine in the mature tRNA). The Figure also does not show the number of
distinct tRNA species (such as distinct sequence families) for each anticodon; often there
is more than one species for each anticodon.
© 2001 Macmillan Magazines Ltd
Page 36
copies of the rDNA tandem repeats in the draft genome sequence,
owing to the deliberate bias in the initial phase of the sequencing
effort against sequencing BAC clones whose restriction fragment
®ngerprints showed them to contain primarily tandemly repeated
sequence. Sequence similarity analysis with the BLASTN computer
program does, however, detect hundreds of rDNA-derived sequence
fragments dispersed throughout the complete genome, including
one `full-length' copy of an individual 5.8S rRNA gene not asso-
ciated with a true tandem repeat unit (Table 20).
The 5S rDNA genes also occur in tandem arrays, the largest of
which is on chromosome 1 between 1q41.11 and 1q42.13, close to
the telomere
265,266
. There are 200±300 true 5S genes in these
arrays
265,267
. The number of 5S-related sequences in the genome,
including numerous dispersed pseudogenes, is classically cited as
2,000 (refs 252, 254). The long tandem array on chromosome 1 is
not yet present in the draft genome sequence because there are no
EcoRI or HindIII sites present, and thus it was not cloned in the
most heavily utilized BAC libraries (Table 1). We expect to recover it
during the ®nishing stage. We do detect four individual copies of 5S
rDNA by our search criteria ($ 95% identity and $ 95% full
length). We also ®nd many more distantly related dispersed
sequences (520 at P # 0.001), which we interpret as probable
pseudogenes (Table 20).
Small nucleolar RNA genes. Eukaryotic rRNA is extensively pro-
cessed and modi®ed in the nucleolus. Much of this activity is
directed by numerous snoRNAs. These come in two families: C/D
box snoRNAs (mostly involved in guiding site-speci®c 29-O-ribose
methylations of other RNAs) and H/ACA snoRNAs (mostly
involved in guiding site-speci®c pseudouridylations)
247,248
. We
compiled a set of 97 known human snoRNA gene sequences; 84
of these (87%) have at least one copy in the draft genome sequence
(Table 20), almost all as single-copy genes.
It is thought that all 29-O-ribose methylations and pseudouri-
dylations in eukaryotic rRNA are guided by snoRNAs. There are
105±107 methylations and around 95 pseudouridylations in human
rRNA
268
. Only about half of these have been tentatively assigned to
known guide snoRNAs. There are also snoRNA-directed
modi®cations on other stable RNAs, such as U6 (ref. 269), and
the extent of this is just beginning to be explored. Sequence
similarity has so far proven insuf®cient to recognize all snoRNA
genes. We therefore expect that there are many unrecognized
snoRNA genes that are not detected by BLAST queries.
Spliceosomal RNAs and other ncRNA genes. We also looked for
copies of other known ncRNA genes. We found at least one copy of
21 (95%) of 22 known ncRNAs, including the spliceosomal
snRNAs. There were multiple copies for several ncRNAs, as
expected; for example, we ®nd 44 dispersed genes for U6 snRNA,
and 16 for U1 snRNA (Table 20).
For some of these RNA genes, homogeneous multigene families
that occur in tandem arrays are again under-represented owing to
the restriction enzymes used in constructing the BAC libraries and,
in some instances, the decision to delay the sequencing of BAC
clones with low complexity ®ngerprints indicative of tandemly
repeated DNA. The U2 RNA genes are located at the RNU2 locus,
a tandem array of 10±20 copies of nearly identical 6.1-kb units at
17q21±q22 (refs 270±272). Similarly, the U3 snoRNA genes
(included in the aggregate count of C/D snoRNAs in Table 20) are
clustered at the RNU3 locus at 17p11.2, not in a tandem array, but in
a complex inverted repeat structure of about 5±10 copies per
haploid genome
273
. The U1 RNA genes are clustered with about
30 copies at the RNU1 locus at 1p36.1, but this cluster is thought to
be loose and irregularly organized; no two U1 genes have been
cloned on the same cosmid
271
. In the draft genome sequence, we see
six copies of U2 RNA that meet our criteria for true genes, three of
which appear to be in the expected position on chromosome 17. For
U3, so far we see one true copy at the correct place on chromosome
17p11.2. For U1, we see 16 true genes, 6 of which are loosely
clustered within 0.6 Mb at 1p36.1 and another 6 are elsewhere on
chromosome 1. Again, these and other clusters will be a matter for
the ®nishing process.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 895
Table 20 Known non-coding RNA genes in the draft genome sequence
RNA gene* Number expected² Number found³ Number of
related genes§
Function
tRNA 1,310 497 324 Protein synthesis
SSU (18S) rRNA 150±200 0 40 Protein synthesis
5.8S rRNA 150±200 1 11 Protein synthesis
LSU (28S) rRNA 150±200 0 181 Protein synthesis
5S rRNA 200±300 4 520 Protein synthesis
U1 ,30 16 134 Spliceosome component
U2 10±20 6 94 Spliceosome component
U4 ?? 4 87 Spliceosome component
U4atac ?? 1 20 Component of minor (U11/U12) spliceosome
U5 ?? 1 31 Spliceosome component
U6 ?? 44 1,135 Spliceosome component
U6atac ?? 4 32 Component of minor (U11/U12) spliceosome
U7 1 1 3 Histone mRNA 39 processing
U11 1 0 6 Component of minor (U11/U12) spliceosome
U12 1 1 0 Component of minor (U11/U12) spliceosome
SRP (7SL) RNA 4 3 773 Component of signal recognition particle (protein secretion)
RNAse P 1 1 2 tRNA 59 end processing
RNAse MRP 1 1 6 rRNA processing
Telomerase RNA 1 1 4 Template for addition of telomeres
hY1 1 1 353 Component of Ro RNP, function unknown
hY3 1 25 414 Component of Ro RNP, function unknown
hY4 1 3 115 Component of Ro RNP, function unknown
hY5 (4.5S RNA) 1 1 9 Component of Ro RNP, function unknown
Vault RNAs 3 3 1 Component of 13-MDa vault RNP, function unknown
7SK 1 1 330 Unknown
H19 1 1 2 Unknown
Xist 1 1 0 Initiation of X chromosome inactivation (dosage compensation)
Known C/D snoRNAs 81 69 558 Pre-rRNA processing or site-speci®c ribose methylation of rRNA
Known H/ACA snoRNAs 16 15 87 Pre-rRNA processing or site-speci®c pseudouridylation of rRNA
...................................................................................................................................................................................................................................................................................................................................................................
* Known ncRNA genes (or gene families, such as the C/D and H/ACA snoRNA families); reference sequences were extracted from GenBank and used to probe the draft genome sequence.
² Number of genes that were expected in the human genome, based on previous literature (note that earlier experimental techniques probably tend to overestimate copy number, by counting closely related
pseudogenes).
³ The copy number of `true' full-length genes identi®ed in the draft genome sequence.
§ The copy number of other signi®cantly related copies (pseudogenes, fragments, paralogues) found. Except for the 497 true tRNA genes, all sequence similarities were identi®ed by WashU BLASTN 2.0MP
(W. Gish, unpublished; http://blast.wustl.edu), with parameters `-kap wordmask = seg B = 50000 W = 8' and the default +5/-4 DNA scoring matrix. True genes were operationally de®ned as BLAST hits
with $ 95% identity over $ 95% of the length of the query. Related sequences were operationally de®ned as all other BLAST hits with P-values # 0.001.
© 2001 Macmillan Magazines Ltd
owing to the deliberate bias in the initial phase of the sequencing
effort against sequencing BAC clones whose restriction fragment
®ngerprints showed them to contain primarily tandemly repeated
sequence. Sequence similarity analysis with the BLASTN computer
program does, however, detect hundreds of rDNA-derived sequence
fragments dispersed throughout the complete genome, including
one `full-length' copy of an individual 5.8S rRNA gene not asso-
ciated with a true tandem repeat unit (Table 20).
The 5S rDNA genes also occur in tandem arrays, the largest of
which is on chromosome 1 between 1q41.11 and 1q42.13, close to
the telomere
265,266
. There are 200±300 true 5S genes in these
arrays
265,267
. The number of 5S-related sequences in the genome,
including numerous dispersed pseudogenes, is classically cited as
2,000 (refs 252, 254). The long tandem array on chromosome 1 is
not yet present in the draft genome sequence because there are no
EcoRI or HindIII sites present, and thus it was not cloned in the
most heavily utilized BAC libraries (Table 1). We expect to recover it
during the ®nishing stage. We do detect four individual copies of 5S
rDNA by our search criteria ($ 95% identity and $ 95% full
length). We also ®nd many more distantly related dispersed
sequences (520 at P # 0.001), which we interpret as probable
pseudogenes (Table 20).
Small nucleolar RNA genes. Eukaryotic rRNA is extensively pro-
cessed and modi®ed in the nucleolus. Much of this activity is
directed by numerous snoRNAs. These come in two families: C/D
box snoRNAs (mostly involved in guiding site-speci®c 29-O-ribose
methylations of other RNAs) and H/ACA snoRNAs (mostly
involved in guiding site-speci®c pseudouridylations)
247,248
. We
compiled a set of 97 known human snoRNA gene sequences; 84
of these (87%) have at least one copy in the draft genome sequence
(Table 20), almost all as single-copy genes.
It is thought that all 29-O-ribose methylations and pseudouri-
dylations in eukaryotic rRNA are guided by snoRNAs. There are
105±107 methylations and around 95 pseudouridylations in human
rRNA
268
. Only about half of these have been tentatively assigned to
known guide snoRNAs. There are also snoRNA-directed
modi®cations on other stable RNAs, such as U6 (ref. 269), and
the extent of this is just beginning to be explored. Sequence
similarity has so far proven insuf®cient to recognize all snoRNA
genes. We therefore expect that there are many unrecognized
snoRNA genes that are not detected by BLAST queries.
Spliceosomal RNAs and other ncRNA genes. We also looked for
copies of other known ncRNA genes. We found at least one copy of
21 (95%) of 22 known ncRNAs, including the spliceosomal
snRNAs. There were multiple copies for several ncRNAs, as
expected; for example, we ®nd 44 dispersed genes for U6 snRNA,
and 16 for U1 snRNA (Table 20).
For some of these RNA genes, homogeneous multigene families
that occur in tandem arrays are again under-represented owing to
the restriction enzymes used in constructing the BAC libraries and,
in some instances, the decision to delay the sequencing of BAC
clones with low complexity ®ngerprints indicative of tandemly
repeated DNA. The U2 RNA genes are located at the RNU2 locus,
a tandem array of 10±20 copies of nearly identical 6.1-kb units at
17q21±q22 (refs 270±272). Similarly, the U3 snoRNA genes
(included in the aggregate count of C/D snoRNAs in Table 20) are
clustered at the RNU3 locus at 17p11.2, not in a tandem array, but in
a complex inverted repeat structure of about 5±10 copies per
haploid genome
273
. The U1 RNA genes are clustered with about
30 copies at the RNU1 locus at 1p36.1, but this cluster is thought to
be loose and irregularly organized; no two U1 genes have been
cloned on the same cosmid
271
. In the draft genome sequence, we see
six copies of U2 RNA that meet our criteria for true genes, three of
which appear to be in the expected position on chromosome 17. For
U3, so far we see one true copy at the correct place on chromosome
17p11.2. For U1, we see 16 true genes, 6 of which are loosely
clustered within 0.6 Mb at 1p36.1 and another 6 are elsewhere on
chromosome 1. Again, these and other clusters will be a matter for
the ®nishing process.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 895
Table 20 Known non-coding RNA genes in the draft genome sequence
RNA gene* Number expected² Number found³ Number of
related genes§
Function
tRNA 1,310 497 324 Protein synthesis
SSU (18S) rRNA 150±200 0 40 Protein synthesis
5.8S rRNA 150±200 1 11 Protein synthesis
LSU (28S) rRNA 150±200 0 181 Protein synthesis
5S rRNA 200±300 4 520 Protein synthesis
U1 ,30 16 134 Spliceosome component
U2 10±20 6 94 Spliceosome component
U4 ?? 4 87 Spliceosome component
U4atac ?? 1 20 Component of minor (U11/U12) spliceosome
U5 ?? 1 31 Spliceosome component
U6 ?? 44 1,135 Spliceosome component
U6atac ?? 4 32 Component of minor (U11/U12) spliceosome
U7 1 1 3 Histone mRNA 39 processing
U11 1 0 6 Component of minor (U11/U12) spliceosome
U12 1 1 0 Component of minor (U11/U12) spliceosome
SRP (7SL) RNA 4 3 773 Component of signal recognition particle (protein secretion)
RNAse P 1 1 2 tRNA 59 end processing
RNAse MRP 1 1 6 rRNA processing
Telomerase RNA 1 1 4 Template for addition of telomeres
hY1 1 1 353 Component of Ro RNP, function unknown
hY3 1 25 414 Component of Ro RNP, function unknown
hY4 1 3 115 Component of Ro RNP, function unknown
hY5 (4.5S RNA) 1 1 9 Component of Ro RNP, function unknown
Vault RNAs 3 3 1 Component of 13-MDa vault RNP, function unknown
7SK 1 1 330 Unknown
H19 1 1 2 Unknown
Xist 1 1 0 Initiation of X chromosome inactivation (dosage compensation)
Known C/D snoRNAs 81 69 558 Pre-rRNA processing or site-speci®c ribose methylation of rRNA
Known H/ACA snoRNAs 16 15 87 Pre-rRNA processing or site-speci®c pseudouridylation of rRNA
...................................................................................................................................................................................................................................................................................................................................................................
* Known ncRNA genes (or gene families, such as the C/D and H/ACA snoRNA families); reference sequences were extracted from GenBank and used to probe the draft genome sequence.
² Number of genes that were expected in the human genome, based on previous literature (note that earlier experimental techniques probably tend to overestimate copy number, by counting closely related
pseudogenes).
³ The copy number of `true' full-length genes identi®ed in the draft genome sequence.
§ The copy number of other signi®cantly related copies (pseudogenes, fragments, paralogues) found. Except for the 497 true tRNA genes, all sequence similarities were identi®ed by WashU BLASTN 2.0MP
(W. Gish, unpublished; http://blast.wustl.edu), with parameters `-kap wordmask = seg B = 50000 W = 8' and the default +5/-4 DNA scoring matrix. True genes were operationally de®ned as BLAST hits
with $ 95% identity over $ 95% of the length of the query. Related sequences were operationally de®ned as all other BLAST hits with P-values # 0.001.
© 2001 Macmillan Magazines Ltd
Page 37
Our observations also con®rm the striking proliferation of
ncRNA-derived pseudogenes (Table 20). There are hundreds or
thousands of sequences in the draft genome sequence related to
some of the ncRNA genes. The most proli®c pseudogene counts
generally come from RNA genes transcribed by RNA polymerase III
promoters, including U6, the hY RNAs and SRP-RNA. These
ncRNA pseudogenes presumably arise through reverse transcrip-
tion. The frequency of such events gives insight into how ncRNA
genes can evolve into SINE retroposons, such as the tRNA-derived
SINEs found in many vertebrates and the SRP-RNA-derived Alu
elements found in humans.
Protein-coding genes
Identifying the protein-coding genes in the human genome is one of
the most important applications of the sequence data, but also one
of the most dif®cult challenges. We describe below our efforts to
create an initial human gene and protein index.
Exploring properties of known genes. Before attempting to
identify new genes, we explored what could be learned by aligning
the cDNA sequences of known genes to the draft genome sequence.
Genomic alignments allow one to study exon±intron structure and
local GC content, and are valuable for biomedical studies because
they connect genes with the genetic and cytogenetic map, link them
with regulatory sequences and facilitate the development of poly-
merase chain reaction (PCR) primers to amplify exons. Until now,
genomic alignment was available for only about a quarter of known
genes.
The `known' genes studied were those in the RefSeq database
110
, a
manually curated collection designed to contain nonredundant
representatives of most full-length human mRNA sequences in
GenBank (RefSeq intentionally contains some alternative splice
forms of the same genes). The version of RefSeq used contained
10,272 mRNAs.
The RefSeq genes were aligned with the draft genome sequence,
using both the Spidey (S. Wheelan, personal communication) and
Acembly (D. Thierry-Mieg and J. Thierry-Mieg, unpublished;
http://www.acedb.org) computer programs. Because this sequence
is incomplete and contains errors, not all genes could be fully
aligned and some may have been incorrectly aligned. More than
92% of the RefSeq entries could be aligned at high stringency over at
least part of their length, and 85% could be aligned over more than
half of their length. Some genes (16%) had high stringency align-
ments to more than one location in the draft genome sequence
owing, for example, to paralogues or pseudogenes. In such cases, we
considered only the best match. In a few of these cases, the assign-
ment may not be correct because the true matching region has not
yet been sequenced. Three per cent of entries appeared to be
alternative splice products of the same gene, on the basis of their
alignment to the same location in the draft genome sequence. In all,
we obtained at least partial genomic alignments for 9,212 distinct
known genes and essentially complete alignment for 5,364 of
them.
Previous efforts to study human gene structure
116,274,275
have been
hampered by limited sample sizes and strong biases in favour of
compact genes. Table 21 gives the mean and median values of some
basic characteristics of gene structures. Some of the values may be
underestimates. In particular, the UTRs given in the RefSeq data-
base are likely to be incomplete; they are considerably shorter, for
example, than those derived from careful reconstructions on chro-
mosome 22. Intron sizes were measured only for genes in ®nished
genomic sequence, to mitigate the bias arising from the fact that
articles
896 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 21 Characteristics of human genes
Median Mean Sample (size)
Internal exon 122 bp 145 bp RefSeq alignments to draft genome sequence, with
con®rmed intron boundaries (43,317 exons)
Exon number 7 8.8 RefSeq alignments to ®nished sequence (3,501 genes)
Introns 1,023 bp 3,365 bp RefSeq alignments to ®nished sequence (27,238 introns)
39 UTR 400 bp 770 bp Con®rmed by mRNA or EST on chromosome 22 (689)
59 UTR 240 bp 300 bp Con®rmed by mRNA or EST on chromosome 22 (463)
Coding sequence
(CDS)
1,100 bp
367 aa
1,340 bp
447 aa
Selected RefSeq entries (1,804)
Genomic extent 14 kb 27 kb Selected RefSeq entries (1,804)
...................................................................................................................................................................................................................................................................................................................................................................
Median and mean values for a number of properties of human protein-coding genes. The 1,804 selected RefSeq entries were those that could be unambiguously aligned to ®nished sequence over their
entire length.
0
1
2
3
4
5
6
7
0 100 200 300 400 500 600 700 800 900 1,000
Exon length (bp)
P
er
ce
nt
ag
e
of
e
xo
ns
P
er
ce
nt
ag
e
of
in
tr
on
s
0
10
20
30
40
50
60
<100 bp 101 bp–2 kb 2 kb–5 kb 5–30 kb >30 kb
0
5
10
15
20
25
0 20 40 60 80 100 120 140 160
Intron length (bp)
P
er
ce
nt
ag
e
of
in
tr
on
s
Human
Worm
Fly
Human
Worm
Fly
Human
Worm
Fly
a
b
c
Intron length
Figure 35 Size distributions of exons, introns and short introns, in sequenced genomes.
a, Exons; b, introns; c, short introns (enlarged from b). Con®rmed exons and introns for
the human were taken from RefSeq alignments and for worm and ¯y from Acembly
alignments of ESTs (J. and D. Thierry-Mieg and, for worm, Y. Kohara, unpublished).
© 2001 Macmillan Magazines Ltd
ncRNA-derived pseudogenes (Table 20). There are hundreds or
thousands of sequences in the draft genome sequence related to
some of the ncRNA genes. The most proli®c pseudogene counts
generally come from RNA genes transcribed by RNA polymerase III
promoters, including U6, the hY RNAs and SRP-RNA. These
ncRNA pseudogenes presumably arise through reverse transcrip-
tion. The frequency of such events gives insight into how ncRNA
genes can evolve into SINE retroposons, such as the tRNA-derived
SINEs found in many vertebrates and the SRP-RNA-derived Alu
elements found in humans.
Protein-coding genes
Identifying the protein-coding genes in the human genome is one of
the most important applications of the sequence data, but also one
of the most dif®cult challenges. We describe below our efforts to
create an initial human gene and protein index.
Exploring properties of known genes. Before attempting to
identify new genes, we explored what could be learned by aligning
the cDNA sequences of known genes to the draft genome sequence.
Genomic alignments allow one to study exon±intron structure and
local GC content, and are valuable for biomedical studies because
they connect genes with the genetic and cytogenetic map, link them
with regulatory sequences and facilitate the development of poly-
merase chain reaction (PCR) primers to amplify exons. Until now,
genomic alignment was available for only about a quarter of known
genes.
The `known' genes studied were those in the RefSeq database
110
, a
manually curated collection designed to contain nonredundant
representatives of most full-length human mRNA sequences in
GenBank (RefSeq intentionally contains some alternative splice
forms of the same genes). The version of RefSeq used contained
10,272 mRNAs.
The RefSeq genes were aligned with the draft genome sequence,
using both the Spidey (S. Wheelan, personal communication) and
Acembly (D. Thierry-Mieg and J. Thierry-Mieg, unpublished;
http://www.acedb.org) computer programs. Because this sequence
is incomplete and contains errors, not all genes could be fully
aligned and some may have been incorrectly aligned. More than
92% of the RefSeq entries could be aligned at high stringency over at
least part of their length, and 85% could be aligned over more than
half of their length. Some genes (16%) had high stringency align-
ments to more than one location in the draft genome sequence
owing, for example, to paralogues or pseudogenes. In such cases, we
considered only the best match. In a few of these cases, the assign-
ment may not be correct because the true matching region has not
yet been sequenced. Three per cent of entries appeared to be
alternative splice products of the same gene, on the basis of their
alignment to the same location in the draft genome sequence. In all,
we obtained at least partial genomic alignments for 9,212 distinct
known genes and essentially complete alignment for 5,364 of
them.
Previous efforts to study human gene structure
116,274,275
have been
hampered by limited sample sizes and strong biases in favour of
compact genes. Table 21 gives the mean and median values of some
basic characteristics of gene structures. Some of the values may be
underestimates. In particular, the UTRs given in the RefSeq data-
base are likely to be incomplete; they are considerably shorter, for
example, than those derived from careful reconstructions on chro-
mosome 22. Intron sizes were measured only for genes in ®nished
genomic sequence, to mitigate the bias arising from the fact that
articles
896 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 21 Characteristics of human genes
Median Mean Sample (size)
Internal exon 122 bp 145 bp RefSeq alignments to draft genome sequence, with
con®rmed intron boundaries (43,317 exons)
Exon number 7 8.8 RefSeq alignments to ®nished sequence (3,501 genes)
Introns 1,023 bp 3,365 bp RefSeq alignments to ®nished sequence (27,238 introns)
39 UTR 400 bp 770 bp Con®rmed by mRNA or EST on chromosome 22 (689)
59 UTR 240 bp 300 bp Con®rmed by mRNA or EST on chromosome 22 (463)
Coding sequence
(CDS)
1,100 bp
367 aa
1,340 bp
447 aa
Selected RefSeq entries (1,804)
Genomic extent 14 kb 27 kb Selected RefSeq entries (1,804)
...................................................................................................................................................................................................................................................................................................................................................................
Median and mean values for a number of properties of human protein-coding genes. The 1,804 selected RefSeq entries were those that could be unambiguously aligned to ®nished sequence over their
entire length.
0
1
2
3
4
5
6
7
0 100 200 300 400 500 600 700 800 900 1,000
Exon length (bp)
P
er
ce
nt
ag
e
of
e
xo
ns
P
er
ce
nt
ag
e
of
in
tr
on
s
0
10
20
30
40
50
60
<100 bp 101 bp–2 kb 2 kb–5 kb 5–30 kb >30 kb
0
5
10
15
20
25
0 20 40 60 80 100 120 140 160
Intron length (bp)
P
er
ce
nt
ag
e
of
in
tr
on
s
Human
Worm
Fly
Human
Worm
Fly
Human
Worm
Fly
a
b
c
Intron length
Figure 35 Size distributions of exons, introns and short introns, in sequenced genomes.
a, Exons; b, introns; c, short introns (enlarged from b). Con®rmed exons and introns for
the human were taken from RefSeq alignments and for worm and ¯y from Acembly
alignments of ESTs (J. and D. Thierry-Mieg and, for worm, Y. Kohara, unpublished).
© 2001 Macmillan Magazines Ltd
Page 39
30% to 50%. The correlation appears to be due primarily to intron
size, which drops markedly with increasing GC content (Fig. 36c).
In contrast, coding properties such as exon length (Fig. 36c) or exon
number (data not shown) vary little. Intergenic distance is also
probably lower in high-GC areas, although this is hard to prove
directly until all genes have been identi®ed.
The large number of con®rmed human introns allows us to
analyse variant splice sites, con®rming and extending recent
reports
281
. Intron positions were con®rmed by applying a stringent
criterion that EST or mRNA sequence show an exact match of 8 bp
in the ¯anking exonic sequence on each side. Of 53,295 con®rmed
introns, 98.12% use the canonical dinucleotides GT at the 59 splice
site and AG at the 39 site (GT±AG pattern). Another 0.76% use the
related GC±AG. About 0.10% use AT±AC, which is a rare alter-
native pattern primarily recognized by the variant U12 splicing
machinery
282
. The remaining 1% belong to 177 types, some of which
undoubtedly re¯ect sequencing or alignment errors.
Finally, we looked at alternative splicing of human genes. Alter-
native splicing can allow many proteins to be produced from a
single gene and can be used for complex gene regulation. It appears
to be prevalent in humans, with lower estimates of about 35% of
human genes being subject to alternative splicing
283±285
. These
studies may have underestimated the prevalence of alternative
splicing, because they examined only EST alignments covering
only a portion of a gene.
To investigate the prevalence of alternative splicing, we analysed
reconstructed mRNA transcripts covering the entire coding regions
of genes on chromosome 22 (omitting small genes with coding
regions of less than 240 bp). Potential transcripts identi®ed by
alignments of ESTs and cDNAs to genomic sequence were veri®ed
by human inspection. We found 642 transcripts, covering 245 genes
(average of 2.6 distinct transcripts per gene). Two or more alter-
natively spliced transcripts were found for 145 (59%) of these genes.
A similar analysis for the gene-rich chromosome 19 gave 1,859
transcripts, corresponding to 544 genes (average 3.2 distinct tran-
scripts per gene). Because we are sampling only a subset of all
transcripts, the true extent of alternative splicing is likely to be
greater. These ®gures are considerably higher than those for worm,
in which analysis reveals alternative splicing for 22% of genes for
which ESTs have been found, with an average of 1.34 (12,816/9,516)
splice variants per gene. (The apparently higher extent of alternative
splicing seen in human than in worm was not an artefact resulting
from much deeper coverage of human genes by ESTs and mRNAs.
Although there are many times more ESTs available for human than
worm, these ESTs tend to have shorter average length (because many
were the product of early sequencing efforts) and many match no
human genes. We calculated the actual coverage per bp used in the
analysis of the human and worm genes; the coverage is only
modestly higher (about 50%) for the human, with a strong bias
towards 39 UTRs which tend to show much less alternative splicing.
We also repeated the analysis using equal coverage for the two
organisms and con®rmed that higher levels of alternative splicing
were still seen in human.)
Seventy per cent of alternative splice forms found in the genes on
chromosomes 19 and 22 affect the coding sequence, rather than
merely changing the 39 or 59 UTR. (This estimate may be affected by
the incomplete representation of UTRs in the RefSeq database and
in the transcripts studied.) Alternative splicing of the terminal exon
was seen for 20% of 6,105 mRNAs that were aligned to the draft
genome sequence and correspond to con®rmed 39 EST clusters. In
addition to alternative splicing, we found evidence of the terminal
exon employing alternative polyadenylation sites (separated by
. 100 bp) in 24% of cases.
Towards a complete index of human genes. We next focused on
creating an initial index of human genes and proteins. This index is
quite incomplete, owing to the dif®culty of gene identi®cation in
human DNA and the imperfect state of the draft genome sequence.
Nonetheless, it is valuable for experimental studies and provides
important insights into the nature of human genes and proteins.
The challenge of identifying genes from genomic sequence varies
greatly among organisms. Gene identi®cation is almost trivial in
bacteria and yeast, because the absence of introns in bacteria and
their paucity in yeast means that most genes can be readily
recognized by ab initio analysis as unusually long ORFs. It is not
as simple, but still relatively straightforward, to identify genes in
animals with small genomes and small introns, such as worm and
¯y. A major factor is the high signal-to-noise ratioÐcoding
sequences comprise a large proportion of the genome and a large
proportion of each gene (about 50% for worm and ¯y), and exons
are relatively large.
Gene identi®cation is more dif®cult in human DNA. The signal-
to-noise ratio is lower: coding sequences comprise only a few per
cent of the genome and an average of about 5% of each gene;
internal exons are smaller than in worms; and genes appear to have
more alternative splicing. The challenge is underscored by the work
on human chromosomes 21 and 22. Even with the availability of
®nished sequence and intensive experimental work, the gene con-
tent remains uncertain, with upper and lower estimates differing by
as much as 30%. The initial report of the ®nished sequence of
chromosome 22 (ref. 94) identi®ed 247 previously known genes,
298 predicted genes con®rmed by sequence homology or ESTs and
325 ab initio predictions without additional support. Many of the
con®rmed predictions represented partial genes. In the past year,
440 additional exons (10%) have been added to existing gene
annotations by the chromosome 22 annotation group, although
the number of con®rmed genes has increased by only 17 and some
previously identi®ed gene predictions have been merged
286
.
Before discussing the gene predictions for the human genome, it
is useful to consider background issues, including previous esti-
mates of the number of human genes, lessons learned from worms
and ¯ies and the representativeness of currently `known' human
genes.
Previous estimates of human gene number. Although direct enumera-
tion of human genes is only now becoming possible with the advent
of the draft genome sequence, there have been many attempts in the
past quarter of a century to estimate the number of genes indirectly.
Early estimates based on reassociation kinetics estimated the mRNA
complexity of typical vertebrate tissues to be 10,000±20,000, and
were extrapolated to suggest around 40,000 for the entire genome
287
.
In the mid-1980s, Gilbert suggested that there might be about
100,000 genes, based on the approximate ratio of the size of a typical
gene (,3 ´ 10
4
bp) to the size of the genome (3 ´ 10
9
bp). Although
this was intended only as a back-of-the-envelope estimate, the
pleasing roundness of the ®gure seems to have led to it being
widely quoted and adopted in many textbooks. (W. Gilbert,
personal communication; ref. 288). An estimate of 70,000±80,000
genes was made by extrapolating from the number of CpG islands
and the frequency of their association with known genes
129
.
As human sequence information has accumulated, it has been
possible to derive estimates on the basis of sampling techniques
289
.
Such studies have sought to extrapolate from various types of data,
including ESTs, mRNAs from known genes, cross-species genome
comparisons and analysis of ®nished chromosomes. Estimates
based on ESTs
290
have varied widely, from 35,000 (ref. 130) to
120,000 genes
291
. Some of the discrepancy lies in differing estimates
of the amount of contaminating genomic sequence in the EST
collection and the extent to which multiple distinct ESTs corre-
spond to a single gene. The most rigorous analyses
130
exclude as
spurious any ESTs that appear only once in the data set and carefully
calibrate sensitivity and speci®city. Such calculations consistently
produce low estimates, in the region of 35,000.
Comparison of whole-genome shotgun sequence from the puf-
fer®sh T. nigroviridis with the human genome
292
can be used to
estimate the density of exons (detected as conserved sequences
articles
898 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
size, which drops markedly with increasing GC content (Fig. 36c).
In contrast, coding properties such as exon length (Fig. 36c) or exon
number (data not shown) vary little. Intergenic distance is also
probably lower in high-GC areas, although this is hard to prove
directly until all genes have been identi®ed.
The large number of con®rmed human introns allows us to
analyse variant splice sites, con®rming and extending recent
reports
281
. Intron positions were con®rmed by applying a stringent
criterion that EST or mRNA sequence show an exact match of 8 bp
in the ¯anking exonic sequence on each side. Of 53,295 con®rmed
introns, 98.12% use the canonical dinucleotides GT at the 59 splice
site and AG at the 39 site (GT±AG pattern). Another 0.76% use the
related GC±AG. About 0.10% use AT±AC, which is a rare alter-
native pattern primarily recognized by the variant U12 splicing
machinery
282
. The remaining 1% belong to 177 types, some of which
undoubtedly re¯ect sequencing or alignment errors.
Finally, we looked at alternative splicing of human genes. Alter-
native splicing can allow many proteins to be produced from a
single gene and can be used for complex gene regulation. It appears
to be prevalent in humans, with lower estimates of about 35% of
human genes being subject to alternative splicing
283±285
. These
studies may have underestimated the prevalence of alternative
splicing, because they examined only EST alignments covering
only a portion of a gene.
To investigate the prevalence of alternative splicing, we analysed
reconstructed mRNA transcripts covering the entire coding regions
of genes on chromosome 22 (omitting small genes with coding
regions of less than 240 bp). Potential transcripts identi®ed by
alignments of ESTs and cDNAs to genomic sequence were veri®ed
by human inspection. We found 642 transcripts, covering 245 genes
(average of 2.6 distinct transcripts per gene). Two or more alter-
natively spliced transcripts were found for 145 (59%) of these genes.
A similar analysis for the gene-rich chromosome 19 gave 1,859
transcripts, corresponding to 544 genes (average 3.2 distinct tran-
scripts per gene). Because we are sampling only a subset of all
transcripts, the true extent of alternative splicing is likely to be
greater. These ®gures are considerably higher than those for worm,
in which analysis reveals alternative splicing for 22% of genes for
which ESTs have been found, with an average of 1.34 (12,816/9,516)
splice variants per gene. (The apparently higher extent of alternative
splicing seen in human than in worm was not an artefact resulting
from much deeper coverage of human genes by ESTs and mRNAs.
Although there are many times more ESTs available for human than
worm, these ESTs tend to have shorter average length (because many
were the product of early sequencing efforts) and many match no
human genes. We calculated the actual coverage per bp used in the
analysis of the human and worm genes; the coverage is only
modestly higher (about 50%) for the human, with a strong bias
towards 39 UTRs which tend to show much less alternative splicing.
We also repeated the analysis using equal coverage for the two
organisms and con®rmed that higher levels of alternative splicing
were still seen in human.)
Seventy per cent of alternative splice forms found in the genes on
chromosomes 19 and 22 affect the coding sequence, rather than
merely changing the 39 or 59 UTR. (This estimate may be affected by
the incomplete representation of UTRs in the RefSeq database and
in the transcripts studied.) Alternative splicing of the terminal exon
was seen for 20% of 6,105 mRNAs that were aligned to the draft
genome sequence and correspond to con®rmed 39 EST clusters. In
addition to alternative splicing, we found evidence of the terminal
exon employing alternative polyadenylation sites (separated by
. 100 bp) in 24% of cases.
Towards a complete index of human genes. We next focused on
creating an initial index of human genes and proteins. This index is
quite incomplete, owing to the dif®culty of gene identi®cation in
human DNA and the imperfect state of the draft genome sequence.
Nonetheless, it is valuable for experimental studies and provides
important insights into the nature of human genes and proteins.
The challenge of identifying genes from genomic sequence varies
greatly among organisms. Gene identi®cation is almost trivial in
bacteria and yeast, because the absence of introns in bacteria and
their paucity in yeast means that most genes can be readily
recognized by ab initio analysis as unusually long ORFs. It is not
as simple, but still relatively straightforward, to identify genes in
animals with small genomes and small introns, such as worm and
¯y. A major factor is the high signal-to-noise ratioÐcoding
sequences comprise a large proportion of the genome and a large
proportion of each gene (about 50% for worm and ¯y), and exons
are relatively large.
Gene identi®cation is more dif®cult in human DNA. The signal-
to-noise ratio is lower: coding sequences comprise only a few per
cent of the genome and an average of about 5% of each gene;
internal exons are smaller than in worms; and genes appear to have
more alternative splicing. The challenge is underscored by the work
on human chromosomes 21 and 22. Even with the availability of
®nished sequence and intensive experimental work, the gene con-
tent remains uncertain, with upper and lower estimates differing by
as much as 30%. The initial report of the ®nished sequence of
chromosome 22 (ref. 94) identi®ed 247 previously known genes,
298 predicted genes con®rmed by sequence homology or ESTs and
325 ab initio predictions without additional support. Many of the
con®rmed predictions represented partial genes. In the past year,
440 additional exons (10%) have been added to existing gene
annotations by the chromosome 22 annotation group, although
the number of con®rmed genes has increased by only 17 and some
previously identi®ed gene predictions have been merged
286
.
Before discussing the gene predictions for the human genome, it
is useful to consider background issues, including previous esti-
mates of the number of human genes, lessons learned from worms
and ¯ies and the representativeness of currently `known' human
genes.
Previous estimates of human gene number. Although direct enumera-
tion of human genes is only now becoming possible with the advent
of the draft genome sequence, there have been many attempts in the
past quarter of a century to estimate the number of genes indirectly.
Early estimates based on reassociation kinetics estimated the mRNA
complexity of typical vertebrate tissues to be 10,000±20,000, and
were extrapolated to suggest around 40,000 for the entire genome
287
.
In the mid-1980s, Gilbert suggested that there might be about
100,000 genes, based on the approximate ratio of the size of a typical
gene (,3 ´ 10
4
bp) to the size of the genome (3 ´ 10
9
bp). Although
this was intended only as a back-of-the-envelope estimate, the
pleasing roundness of the ®gure seems to have led to it being
widely quoted and adopted in many textbooks. (W. Gilbert,
personal communication; ref. 288). An estimate of 70,000±80,000
genes was made by extrapolating from the number of CpG islands
and the frequency of their association with known genes
129
.
As human sequence information has accumulated, it has been
possible to derive estimates on the basis of sampling techniques
289
.
Such studies have sought to extrapolate from various types of data,
including ESTs, mRNAs from known genes, cross-species genome
comparisons and analysis of ®nished chromosomes. Estimates
based on ESTs
290
have varied widely, from 35,000 (ref. 130) to
120,000 genes
291
. Some of the discrepancy lies in differing estimates
of the amount of contaminating genomic sequence in the EST
collection and the extent to which multiple distinct ESTs corre-
spond to a single gene. The most rigorous analyses
130
exclude as
spurious any ESTs that appear only once in the data set and carefully
calibrate sensitivity and speci®city. Such calculations consistently
produce low estimates, in the region of 35,000.
Comparison of whole-genome shotgun sequence from the puf-
fer®sh T. nigroviridis with the human genome
292
can be used to
estimate the density of exons (detected as conserved sequences
articles
898 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
Page 40
between ®sh and human). These analyses also suggest around
30,000 human genes.
Extrapolations have also been made from the gene counts for
chromosomes 21 and 22 (refs 93, 94), adjusted for differences
in gene densities on these chromosomes, as inferred from EST
mapping. These estimates are between 30,500 and 35,500, depend-
ing on the precise assumptions used
286
.
Insights from invertebrates. The worm and ¯y genomes contain a
large proportion of novel genes (around 50% of worm genes and
30% of ¯y genes), in the sense of showing no signi®cant similarity to
organisms outside their phylum
293±295
. Such genes may have been
present in the original eukaryotic ancestor, but were subsequently
lost from the lineages of the other eukaryotes for which sequence is
available; they may be rapidly diverging genes, so that it is dif®cult to
recognize homologues solely on the basis of sequence; they may
represent true innovations developed within the lineage; or they
may represent acquisitions by horizontal transfer. Whatever their
origin, these genes tend to have different biological properties from
highly conserved genes. In particular, they tend to have low expres-
sion levels as assayed both by direct studies and by a paucity of
corresponding ESTs, and are less likely to produce a visible pheno-
type in loss-of-function genetic experiments
294,296
.
Gene prediction. Current gene prediction methods employ combi-
nations of three basic approaches: direct evidence of transcription
provided by ESTs or mRNAs
297±299
; indirect evidence based on
sequence similarity to previously identi®ed genes and proteins
300,301
;
and ab initio recognition of groups of exons on the basis of hidden
Markov models (HMMs) that combine statistical information
about splice sites, coding bias and exon and intron lengths (for
example, Genscan
275
, Genie
302,303
and FGENES
304
).
The ®rst approach relies on direct experimental data, but is
subject to artefacts arising from contaminating ESTs derived from
unspliced mRNAs, genomic DNA contamination and nongenic
transcription (for example, from the promoter of a transposable
element). The ®rst two problems can be mitigated by comparing
transcripts with the genomic sequence and using only those that
show clear evidence of splicing. This solution, however, tends to
discard evidence from genes with long terminal exons or single
exons. The second approach tends correctly to identify gene-derived
sequences, although some of these may be pseudogenes. However, it
obviously cannot identify truly novel genes that have no sequence
similarity to known genes. The third approach would suf®ce alone if
one could accurately de®ne the features used by cells for gene
recognition, but our current understanding is insuf®cient to do
so. The sensitivity and speci®city of ab initio predictions are greatly
affected by the signal-to-noise ratio. Such methods are more
accurate in the ¯y and worm than in human. In ¯y, ab initio
methods can correctly predict around 90% of individual exons and
can correctly predict all coding exons of a gene in about 40% of
cases
303
. For human, the comparable ®gures are only about 70% and
20%, respectively
94,305
. These estimates may be optimistic, owing to
the design of the tests used.
In any collection of gene predictions, we can expect to see various
errors. Some gene predictions may represent partial genes, because
of inability to detect some portions of a gene (incomplete sensitiv-
ity) or to connect all the components of a gene (fragmentation);
some may be gene fusions; and others may be spurious predictions
(incomplete speci®city) resulting from chance matches or pseudo-
genes.
Creating an initial gene index. We set out to create an initial
integrated gene index (IGI) and an associated integrated protein
index (IPI) for the human genome. We describe the results obtained
from a version of the draft genome sequence based on the sequence
data available in July 2000, to allow time for detailed analysis of the
gene and protein content. The additional sequence data that has
since become available will affect the results quantitatively, but are
unlikely to change the conclusions qualitatively.
We began with predictions produced by the Ensembl system
306
.
Ensembl starts with ab initio predictions produced by Genscan
275
and then attempts to con®rm them by virtue of similarity to
proteins, mRNAs, ESTs and protein motifs (contained in the
Pfam database
307
) from any organism. In particular, it con®rms
introns if they are bridged by matches and exons if they are ¯anked
by con®rmed introns. It then attempts to extend protein matches
using the GeneWise computer program
308
. Because it requires
con®rmatory evidence to support each gene component, it fre-
quently produces partial gene predictions. In addition, when there
is evidence of alternative splicing, it reports multiple overlapping
transcripts. In total, Ensembl produced 35,500 gene predictions
with 44,860 transcripts.
To reduce fragmentation, we next merged Ensembl-based gene
predictions with overlapping gene predictions from another
program, Genie
302
. Genie starts with mRNA or EST matches and
employs an HMM to extend these matches by using ab initio
statistical approaches. To avoid fragmentation, it attempts to link
information from 59 and 39 ESTs from the same cDNA clone and
thereby to produce a complete coding sequence from an initial ATG
to a stop codon. As a result, it may generate complete genes more
accurately than Ensembl in cases where there is extensive EST
support. (Genie also generates potential alternative transcripts,
but we used only the longest transcript in each group.) We
merged 15,437 Ensembl predictions into 9,526 clusters, and the
longest transcript in each cluster (from either Genie or Ensembl)
was taken as the representative.
Next, we merged these results with known genes contained in the
RefSeq (version of 29 September 2000), SWISSPROT (release 39.6
of 30 August 2000) and TrEMBL databases (TrEMBL release 14.17
of 1 October 2000, TrEMBL_new of 1 October 2000). Incorporating
these sequences gave rise to overlapping sequences because of
alternative splice forms and partial sequences. To construct a
nonredundant set, we selected the longest sequence from each
overlapping set by using direct protein comparison and by mapping
the gene predictions back onto the genome to construct the over-
lapping sets. This may occasionally remove some close paralogues in
the event that the correct genomic location has not yet been
sequenced, but this number is expected to be small.
Finally, we searched the set to eliminate any genes derived from
contaminating bacterial sequences, recognized by virtue of near
identity to known bacterial plasmids, transposons and chromoso-
mal genes. Although most instances of such contamination had
been removed in the assembly process, a few cases had slipped
through and were removed at this stage.
The process resulted in version 1 of the IGI (IGI.1). The
composition of the corresponding IPI.1 protein set, obtained by
translating IGI.1, is given in Table 22. There are 31,778 protein
predictions, with 14,882 from known genes, 4,057 predictions from
Ensembl merged with Genie and 12,839 predictions from Ensembl
alone. The average lengths are 469 amino acids for the known
proteins, 443 amino acids for protein predictions from the
Ensembl±Genie merge, and 187 amino acids for those from
Ensembl alone. (The smaller average size for the predictions from
Ensembl alone re¯ects its tendency to predict partial genes where
there is supporting evidence for only part of the gene; the remainder
of the gene will often not be predicted at all, rather than included as
part of another prediction. Accordingly, the smaller size cannot be
used to estimate the rate of fragmentation in such predictions.)
The set corresponds to fewer than 31,000 actual genes, because
some genes are fragmented into more than one partial prediction
and some predictions may be spurious or correspond to pseudo-
genes. As discussed below, our best estimate is that IGI.1 includes
about 24,500 true genes.
Evaluation of IGI/IPI. We used several approaches to evaluate the
sensitivity, speci®city and fragmentation of the IGI/IPI set.
Comparison with `new' known genes. One approach was to examine
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 899© 2001 Macmillan Magazines Ltd
30,000 human genes.
Extrapolations have also been made from the gene counts for
chromosomes 21 and 22 (refs 93, 94), adjusted for differences
in gene densities on these chromosomes, as inferred from EST
mapping. These estimates are between 30,500 and 35,500, depend-
ing on the precise assumptions used
286
.
Insights from invertebrates. The worm and ¯y genomes contain a
large proportion of novel genes (around 50% of worm genes and
30% of ¯y genes), in the sense of showing no signi®cant similarity to
organisms outside their phylum
293±295
. Such genes may have been
present in the original eukaryotic ancestor, but were subsequently
lost from the lineages of the other eukaryotes for which sequence is
available; they may be rapidly diverging genes, so that it is dif®cult to
recognize homologues solely on the basis of sequence; they may
represent true innovations developed within the lineage; or they
may represent acquisitions by horizontal transfer. Whatever their
origin, these genes tend to have different biological properties from
highly conserved genes. In particular, they tend to have low expres-
sion levels as assayed both by direct studies and by a paucity of
corresponding ESTs, and are less likely to produce a visible pheno-
type in loss-of-function genetic experiments
294,296
.
Gene prediction. Current gene prediction methods employ combi-
nations of three basic approaches: direct evidence of transcription
provided by ESTs or mRNAs
297±299
; indirect evidence based on
sequence similarity to previously identi®ed genes and proteins
300,301
;
and ab initio recognition of groups of exons on the basis of hidden
Markov models (HMMs) that combine statistical information
about splice sites, coding bias and exon and intron lengths (for
example, Genscan
275
, Genie
302,303
and FGENES
304
).
The ®rst approach relies on direct experimental data, but is
subject to artefacts arising from contaminating ESTs derived from
unspliced mRNAs, genomic DNA contamination and nongenic
transcription (for example, from the promoter of a transposable
element). The ®rst two problems can be mitigated by comparing
transcripts with the genomic sequence and using only those that
show clear evidence of splicing. This solution, however, tends to
discard evidence from genes with long terminal exons or single
exons. The second approach tends correctly to identify gene-derived
sequences, although some of these may be pseudogenes. However, it
obviously cannot identify truly novel genes that have no sequence
similarity to known genes. The third approach would suf®ce alone if
one could accurately de®ne the features used by cells for gene
recognition, but our current understanding is insuf®cient to do
so. The sensitivity and speci®city of ab initio predictions are greatly
affected by the signal-to-noise ratio. Such methods are more
accurate in the ¯y and worm than in human. In ¯y, ab initio
methods can correctly predict around 90% of individual exons and
can correctly predict all coding exons of a gene in about 40% of
cases
303
. For human, the comparable ®gures are only about 70% and
20%, respectively
94,305
. These estimates may be optimistic, owing to
the design of the tests used.
In any collection of gene predictions, we can expect to see various
errors. Some gene predictions may represent partial genes, because
of inability to detect some portions of a gene (incomplete sensitiv-
ity) or to connect all the components of a gene (fragmentation);
some may be gene fusions; and others may be spurious predictions
(incomplete speci®city) resulting from chance matches or pseudo-
genes.
Creating an initial gene index. We set out to create an initial
integrated gene index (IGI) and an associated integrated protein
index (IPI) for the human genome. We describe the results obtained
from a version of the draft genome sequence based on the sequence
data available in July 2000, to allow time for detailed analysis of the
gene and protein content. The additional sequence data that has
since become available will affect the results quantitatively, but are
unlikely to change the conclusions qualitatively.
We began with predictions produced by the Ensembl system
306
.
Ensembl starts with ab initio predictions produced by Genscan
275
and then attempts to con®rm them by virtue of similarity to
proteins, mRNAs, ESTs and protein motifs (contained in the
Pfam database
307
) from any organism. In particular, it con®rms
introns if they are bridged by matches and exons if they are ¯anked
by con®rmed introns. It then attempts to extend protein matches
using the GeneWise computer program
308
. Because it requires
con®rmatory evidence to support each gene component, it fre-
quently produces partial gene predictions. In addition, when there
is evidence of alternative splicing, it reports multiple overlapping
transcripts. In total, Ensembl produced 35,500 gene predictions
with 44,860 transcripts.
To reduce fragmentation, we next merged Ensembl-based gene
predictions with overlapping gene predictions from another
program, Genie
302
. Genie starts with mRNA or EST matches and
employs an HMM to extend these matches by using ab initio
statistical approaches. To avoid fragmentation, it attempts to link
information from 59 and 39 ESTs from the same cDNA clone and
thereby to produce a complete coding sequence from an initial ATG
to a stop codon. As a result, it may generate complete genes more
accurately than Ensembl in cases where there is extensive EST
support. (Genie also generates potential alternative transcripts,
but we used only the longest transcript in each group.) We
merged 15,437 Ensembl predictions into 9,526 clusters, and the
longest transcript in each cluster (from either Genie or Ensembl)
was taken as the representative.
Next, we merged these results with known genes contained in the
RefSeq (version of 29 September 2000), SWISSPROT (release 39.6
of 30 August 2000) and TrEMBL databases (TrEMBL release 14.17
of 1 October 2000, TrEMBL_new of 1 October 2000). Incorporating
these sequences gave rise to overlapping sequences because of
alternative splice forms and partial sequences. To construct a
nonredundant set, we selected the longest sequence from each
overlapping set by using direct protein comparison and by mapping
the gene predictions back onto the genome to construct the over-
lapping sets. This may occasionally remove some close paralogues in
the event that the correct genomic location has not yet been
sequenced, but this number is expected to be small.
Finally, we searched the set to eliminate any genes derived from
contaminating bacterial sequences, recognized by virtue of near
identity to known bacterial plasmids, transposons and chromoso-
mal genes. Although most instances of such contamination had
been removed in the assembly process, a few cases had slipped
through and were removed at this stage.
The process resulted in version 1 of the IGI (IGI.1). The
composition of the corresponding IPI.1 protein set, obtained by
translating IGI.1, is given in Table 22. There are 31,778 protein
predictions, with 14,882 from known genes, 4,057 predictions from
Ensembl merged with Genie and 12,839 predictions from Ensembl
alone. The average lengths are 469 amino acids for the known
proteins, 443 amino acids for protein predictions from the
Ensembl±Genie merge, and 187 amino acids for those from
Ensembl alone. (The smaller average size for the predictions from
Ensembl alone re¯ects its tendency to predict partial genes where
there is supporting evidence for only part of the gene; the remainder
of the gene will often not be predicted at all, rather than included as
part of another prediction. Accordingly, the smaller size cannot be
used to estimate the rate of fragmentation in such predictions.)
The set corresponds to fewer than 31,000 actual genes, because
some genes are fragmented into more than one partial prediction
and some predictions may be spurious or correspond to pseudo-
genes. As discussed below, our best estimate is that IGI.1 includes
about 24,500 true genes.
Evaluation of IGI/IPI. We used several approaches to evaluate the
sensitivity, speci®city and fragmentation of the IGI/IPI set.
Comparison with `new' known genes. One approach was to examine
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 899© 2001 Macmillan Magazines Ltd
Page 42
overprediction rate of 30% for gene predictions in this expanded set,
the analysis above suggests that IGI+ set contains about 28,000 true
genes and yields an estimate of about 32,000 human genes. We are
investigating ways to ®lter the expanded set, to produce an IGI with
the advantage of the increased sensitivity resulting from combining
multiple gene prediction programs without the corresponding loss
of speci®city. Meanwhile, the IGI+ set can be used by researchers
searching for genes that cannot be found in the IGI.
Some classes of genes may have been missed by all of the gene-
®nding methods. Genes could be missed if they are expressed at low
levels or in rare tissues (being absent or very under-represented in
EST and mRNA databases) and have sequences that evolve rapidly
(being hard to detect by protein homology and genome compar-
ison). Both the worm and ¯y gene sets contain a substantial number
of such genes
293,294
. Single-exon genes encoding small proteins may
also have been missed, because EST evidence that supports them
cannot be distinguished from genomic contamination in the EST
dataset and because homology may be hard to detect for small
proteins
310
.
The human thus appears to have only about twice as many genes
as worm or ¯y. However, human genes differ in important respects
from those in worm and ¯y. They are spread out over much larger
regions of genomic DNA, and they are used to construct more
alternative transcripts. This may result in perhaps ®ve times as many
primary protein products in the human as in the worm or ¯y.
The predicted gene and protein sets described here are clearly far
from ®nal. Nonetheless, they provide a valuable starting point for
experimental and computational research. The predictions will
improve progressively as the sequence is ®nished, as further
con®rmatory evidence becomes available (particularly from
other vertebrate genome sequences, such as those of mouse and
T. nigroviridis), and as computational methods improve. We intend
to create and release updated versions of the IGI and IPI regularly,
until they converge to a ®nal accurate list of every human gene. The
gene predictions will be linked to RefSeq, HUGO and SWISSPROT
identi®ers where available, and tracking identi®ers between versions
will be included, so that individual genes under study can be traced
forwards as the human sequence is completed.
Comparative proteome analysis
Knowledge of the human proteome will provide unprecedented
opportunities for studies of human gene function. Often clues will
be provided by sequence similarity with proteins of known function
in model organisms. Such initial observations must then be fol-
lowed up by detailed studies to establish the actual function of these
molecules in humans.
For example, 35 proteins are known to be involved in the vacuolar
protein-sorting machinery in yeast. Human genes encoding homo-
logues can be found in the draft human sequence for 34 of these
yeast proteins, but precise relationships are not always clear. In nine
cases there appears to be a single clear human orthologue (a gene
that arose as a consequence of speciation); in 12 cases there are
matches to a family of human paralogues (genes that arose owing to
intra-genome duplication); and in 13 cases there are matches
to speci®c protein domains
311±314
. Hundreds of similar stories
emerge from the draft sequence, but each merits a detailed inter-
pretation in context. To treat these subjects properly, there will be
many following studies, the ®rst of which appear in accompanying
papers
315±323
.
Here, we aim to take a more global perspective on the content of
the human proteome by comparing it with the proteomes of yeast,
worm, ¯y and mustard weed. Such comparisons shed useful light on
the commonalities and differences among these eukaryotes
294,324,325
.
The analysis is necessarily preliminary, because of the imperfect
nature of the human sequence, uncertainties in the gene and protein
sets for all of the multicellular organisms considered and our
incomplete knowledge of protein structures. Nonetheless, some
general patterns emerge. These include insights into fundamental
mechanisms that create functional diversity, including invention of
protein domains, expansion of protein and domain families, evolu-
tion of new protein architectures and horizontal transfer of genes.
Other mechanisms, such as alternative splicing, post-translational
modi®cation and complex regulatory networks, are also crucial in
generating diversity but are much harder to discern from the
primary sequence. We will not attempt to consider the effects of
alternative splicing on proteins; we will consider only a single splice
form from each gene in the various organisms, even when multiple
splice forms are known.
Functional and evolutionary classi®cation. We began by classify-
ing the human proteome on the basis of functional categories and
evolutionary conservation. We used the InterPro annotation pro-
tocol to identify conserved biochemical and cellular processes.
InterPro is a tool for combining sequence-pattern information
from four databases. The ®rst two databases (PRINTS
326
and
Prosite
327
) primarily contain information about motifs correspond-
ing to speci®c family subtypes, such as type II receptor tyrosine
kinases (RTK-II) in particular or tyrosine kinases in general. The
second two databases (Pfam
307
and Prosite Pro®le
327
) contain
information (in the form of pro®les or HMMs) about families of
structural domainsÐfor example, protein kinase domains. Inter-
Pro integrates the motif and domain assignments into a hierarchical
classi®cation system; so a protein might be classi®ed at the most
detailed level as being an RTK-II, at a more general level as being a
kinase speci®c for tyrosine, and at a still more general level as
being a protein kinase. The complete hierarchy of InterPro entries
is described at http://www.ebi.ac.uk/interpro/. We collapsed the
InterPro entries into 12 broad categories, each re¯ecting a set of
cellular functions.
The InterPro families are partly the product of human judgement
and re¯ect the current state of biological and evolutionary knowl-
edge. The system is a valuable way to gain insight into large
collections of proteins, but not all proteins can be classi®ed at
present. The proportions of the yeast, worm, ¯y and mustard weed
protein sets that are assigned to at least one InterPro family is, for
each organism, about 50% (Table 23; refs 307, 326, 327).
About 40% of the predicted human proteins in the IPI could be
assigned to InterPro entries and functional categories. On the basis
of these assignments, we could compare organisms according to the
number of proteins in each category (Fig. 37). Compared with the
two invertebrates, humans appear to have many proteins involved
in cytoskeleton, defence and immunity, and transcription and
translation. These expansions are clearly related to aspects of
vertebrate physiology. Humans also have many more proteins that
are classi®ed as falling into more than one functional category (426
in human versus 80 in worm and 57 in ¯y, data not shown).
Interestingly, 32% of these are transmembrane receptors.
We obtained further insight into the evolutionary conservation of
proteins by comparing each sequence to the complete nonredun-
dant database of protein sequences maintained at NCBI, using the
BLASTP computer program
328
and then breaking down the matches
according to organismal taxonomy (Fig. 38). Overall, 74% of the
proteins had signi®cant matches to known proteins.
Such classi®cations are based on the presence of clearly detectable
homologues in existing databases. Many of these genes have surely
evolved from genes that were present in common ancestors but have
since diverged substantially. Indeed, one can detect more distant
relationships by using sensitive computer programs that can recog-
nize weakly conserved features. Using PSI-BLAST, we can recognize
probable nonvertebrate homologues for about 45% of the `verte-
brate-speci®c' set. Nonetheless, the classi®cation is useful for gain-
ing insights into the commonalities and differences among the
proteomes of different organisms.
Probable horizontal transfer. An interesting category is a set of 223
proteins that have signi®cant similarity to proteins from bacteria,
but no comparable similarity to proteins from yeast, worm, ¯y and
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 901© 2001 Macmillan Magazines Ltd
the analysis above suggests that IGI+ set contains about 28,000 true
genes and yields an estimate of about 32,000 human genes. We are
investigating ways to ®lter the expanded set, to produce an IGI with
the advantage of the increased sensitivity resulting from combining
multiple gene prediction programs without the corresponding loss
of speci®city. Meanwhile, the IGI+ set can be used by researchers
searching for genes that cannot be found in the IGI.
Some classes of genes may have been missed by all of the gene-
®nding methods. Genes could be missed if they are expressed at low
levels or in rare tissues (being absent or very under-represented in
EST and mRNA databases) and have sequences that evolve rapidly
(being hard to detect by protein homology and genome compar-
ison). Both the worm and ¯y gene sets contain a substantial number
of such genes
293,294
. Single-exon genes encoding small proteins may
also have been missed, because EST evidence that supports them
cannot be distinguished from genomic contamination in the EST
dataset and because homology may be hard to detect for small
proteins
310
.
The human thus appears to have only about twice as many genes
as worm or ¯y. However, human genes differ in important respects
from those in worm and ¯y. They are spread out over much larger
regions of genomic DNA, and they are used to construct more
alternative transcripts. This may result in perhaps ®ve times as many
primary protein products in the human as in the worm or ¯y.
The predicted gene and protein sets described here are clearly far
from ®nal. Nonetheless, they provide a valuable starting point for
experimental and computational research. The predictions will
improve progressively as the sequence is ®nished, as further
con®rmatory evidence becomes available (particularly from
other vertebrate genome sequences, such as those of mouse and
T. nigroviridis), and as computational methods improve. We intend
to create and release updated versions of the IGI and IPI regularly,
until they converge to a ®nal accurate list of every human gene. The
gene predictions will be linked to RefSeq, HUGO and SWISSPROT
identi®ers where available, and tracking identi®ers between versions
will be included, so that individual genes under study can be traced
forwards as the human sequence is completed.
Comparative proteome analysis
Knowledge of the human proteome will provide unprecedented
opportunities for studies of human gene function. Often clues will
be provided by sequence similarity with proteins of known function
in model organisms. Such initial observations must then be fol-
lowed up by detailed studies to establish the actual function of these
molecules in humans.
For example, 35 proteins are known to be involved in the vacuolar
protein-sorting machinery in yeast. Human genes encoding homo-
logues can be found in the draft human sequence for 34 of these
yeast proteins, but precise relationships are not always clear. In nine
cases there appears to be a single clear human orthologue (a gene
that arose as a consequence of speciation); in 12 cases there are
matches to a family of human paralogues (genes that arose owing to
intra-genome duplication); and in 13 cases there are matches
to speci®c protein domains
311±314
. Hundreds of similar stories
emerge from the draft sequence, but each merits a detailed inter-
pretation in context. To treat these subjects properly, there will be
many following studies, the ®rst of which appear in accompanying
papers
315±323
.
Here, we aim to take a more global perspective on the content of
the human proteome by comparing it with the proteomes of yeast,
worm, ¯y and mustard weed. Such comparisons shed useful light on
the commonalities and differences among these eukaryotes
294,324,325
.
The analysis is necessarily preliminary, because of the imperfect
nature of the human sequence, uncertainties in the gene and protein
sets for all of the multicellular organisms considered and our
incomplete knowledge of protein structures. Nonetheless, some
general patterns emerge. These include insights into fundamental
mechanisms that create functional diversity, including invention of
protein domains, expansion of protein and domain families, evolu-
tion of new protein architectures and horizontal transfer of genes.
Other mechanisms, such as alternative splicing, post-translational
modi®cation and complex regulatory networks, are also crucial in
generating diversity but are much harder to discern from the
primary sequence. We will not attempt to consider the effects of
alternative splicing on proteins; we will consider only a single splice
form from each gene in the various organisms, even when multiple
splice forms are known.
Functional and evolutionary classi®cation. We began by classify-
ing the human proteome on the basis of functional categories and
evolutionary conservation. We used the InterPro annotation pro-
tocol to identify conserved biochemical and cellular processes.
InterPro is a tool for combining sequence-pattern information
from four databases. The ®rst two databases (PRINTS
326
and
Prosite
327
) primarily contain information about motifs correspond-
ing to speci®c family subtypes, such as type II receptor tyrosine
kinases (RTK-II) in particular or tyrosine kinases in general. The
second two databases (Pfam
307
and Prosite Pro®le
327
) contain
information (in the form of pro®les or HMMs) about families of
structural domainsÐfor example, protein kinase domains. Inter-
Pro integrates the motif and domain assignments into a hierarchical
classi®cation system; so a protein might be classi®ed at the most
detailed level as being an RTK-II, at a more general level as being a
kinase speci®c for tyrosine, and at a still more general level as
being a protein kinase. The complete hierarchy of InterPro entries
is described at http://www.ebi.ac.uk/interpro/. We collapsed the
InterPro entries into 12 broad categories, each re¯ecting a set of
cellular functions.
The InterPro families are partly the product of human judgement
and re¯ect the current state of biological and evolutionary knowl-
edge. The system is a valuable way to gain insight into large
collections of proteins, but not all proteins can be classi®ed at
present. The proportions of the yeast, worm, ¯y and mustard weed
protein sets that are assigned to at least one InterPro family is, for
each organism, about 50% (Table 23; refs 307, 326, 327).
About 40% of the predicted human proteins in the IPI could be
assigned to InterPro entries and functional categories. On the basis
of these assignments, we could compare organisms according to the
number of proteins in each category (Fig. 37). Compared with the
two invertebrates, humans appear to have many proteins involved
in cytoskeleton, defence and immunity, and transcription and
translation. These expansions are clearly related to aspects of
vertebrate physiology. Humans also have many more proteins that
are classi®ed as falling into more than one functional category (426
in human versus 80 in worm and 57 in ¯y, data not shown).
Interestingly, 32% of these are transmembrane receptors.
We obtained further insight into the evolutionary conservation of
proteins by comparing each sequence to the complete nonredun-
dant database of protein sequences maintained at NCBI, using the
BLASTP computer program
328
and then breaking down the matches
according to organismal taxonomy (Fig. 38). Overall, 74% of the
proteins had signi®cant matches to known proteins.
Such classi®cations are based on the presence of clearly detectable
homologues in existing databases. Many of these genes have surely
evolved from genes that were present in common ancestors but have
since diverged substantially. Indeed, one can detect more distant
relationships by using sensitive computer programs that can recog-
nize weakly conserved features. Using PSI-BLAST, we can recognize
probable nonvertebrate homologues for about 45% of the `verte-
brate-speci®c' set. Nonetheless, the classi®cation is useful for gain-
ing insights into the commonalities and differences among the
proteomes of different organisms.
Probable horizontal transfer. An interesting category is a set of 223
proteins that have signi®cant similarity to proteins from bacteria,
but no comparable similarity to proteins from yeast, worm, ¯y and
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 901© 2001 Macmillan Magazines Ltd
Page 43
mustard weed, or indeed from any other (nonvertebrate) eukaryote.
These sequences should not represent bacterial contamination in
the draft human sequence, because we ®ltered the sequence to
eliminate sequences that were essentially identical to known bacter-
ial plasmid, transposon or chromosomal DNA (such as the host
strains for the large-insert clones). To investigate whether these were
genuine human sequences, we designed PCR primers for 35 of these
genes and con®rmed that most could be readily detected directly in
human genomic DNA (Table 24). Orthologues of many of these
genes have also been detected in other vertebrates (Table 24).
A more detailed computational analysis indicated that at least 113
of these genes are widespread among bacteria, but, among eukar-
yotes, appear to be present only in vertebrates. It is possible that the
genes encoding these proteins were present in both early prokar-
yotes and eukaryotes, but were lost in each of the lineages of yeast,
worm, ¯y, mustard weed and, possibly, from other nonvertebrate
eukaryote lineages. A more parsimonious explanation is that these
genes entered the vertebrate (or prevertebrate) lineage by horizontal
transfer from bacteria. Many of these genes contain introns, which
presumably were acquired after the putative horizontal transfer
event. Similar observations indicating probable lineage-speci®c
horizontal gene transfers, as well as intron insertion in the acquired
genes, have been made in the worm genome
329
.
We cannot formally exclude the possibility that gene transfer
occurred in the opposite directionÐthat is, that the genes were
invented in the vertebrate lineage and then transferred to bacteria.
However, we consider this less likely. Under this scenario, the broad
distribution of these genes among bacteria would require extensive
horizontal dissemination after their initial acquisition. In addition,
the functional repertoire of these genes, which largely encode
intracellular enzymes (Table 24), is uncharacteristic of vertebrate-
speci®c evolutionary innovations (which appear to be primarily
extracellular proteins; see below).
We did not identify a strongly preferred bacterial source for the
putative horizontally transferred genes, indicating the likelihood
of multiple independent gene transfers from different bacteria
(Table 24). Notably, several of the probable recent acquisitions
have established (or likely) roles in metabolism of xenobiotics or
stress response. These include several hydrolases of different
speci®cities, including epoxide hydrolase, and several dehydro-
genases (Table 24). Of particular interest is the presence of two
paralogues of monoamine oxidase (MAO), an enzyme of the
mitochondrial outer membrane that is central in the metabolism
of neuromediators and is a target of important psychiatric
drugs
330±333
. This example shows that at least some of the genes
thought to be horizontally transferred into the vertebrate lineage
appear to be involved in important physiological functions and so
probably have been ®xed and maintained during evolution because
articles
902 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
Ce
llu
lar
p
ro
ce
ss
es
M
eta
bo
lis
m
DN
A
rep
lic
ati
on
/m
od
ific
ati
on
Tr
an
sc
rip
tio
n/
tra
ns
lat
ion
Int
rac
ell
ula
r s
ign
all
ing
Ce
ll-c
ell
co
mm
un
ica
tio
n
Pr
ot
ein
fo
ldi
ng
an
d
de
gr
ad
ati
on
Tr
an
sp
or
t
M
ult
ifu
nc
tio
na
l p
ro
tei
ns
Cy
to
sk
ele
tal
/st
ru
ctu
ral
De
fen
ce
an
d
im
mu
nit
y
M
isc
ell
an
eo
us
fu
nc
tio
n
N
um
be
r o
f p
ro
te
in
s
Yeast
Mustard
weed
Worm
Fly
Human
Figure 37 Functional categories in eukaryotic proteomes. The classi®cation categories
were derived from functional classi®cation systems, including the top-level biological
function category of the Gene Ontology project (GO; see http://www.geneontology.org).
Animals and
other eukaryotes
32%
Vertebrate only
22%
No animal
homology
1%
Vertebrates and
other animals
24%
Prokaryotes
only
<1% Eukaryote and
prokaryote
21%
Figure 38 Distribution of the homologues of the predicted human proteins. For each
protein, a homologue to a phylogenetic lineage was considered present if a search of the
NCBI nonredundant protein sequence database, using the gapped BLASTP program, gave
a random expectation (E ) value of # 0.001. Additional searches for probable homologues
with lower sequence conservation were performed using the PSI-BLAST program, run for
three iterations using the same cut-off for inclusion of sequences into the pro®le
328
.
Table 23 Properties of genome and proteome in essentially completed eukaryotic proteomes
Human Fly Worm Yeast Mustard weed
Number of identi®ed genes ,32,000* 13,338 18,266 6,144 25,706
% with InterPro matches 51 56 50 50 52
Number of annotated domain families 1,262 1,035 1,014 851 1,010
Number of InterPro entries per gene 0.53 0.84 0.63 0.6 0.62
Number of distinct domain architectures 1,695 1,036 1,018 310 ±
Percentage of 1-1-1-1 1.40 4.20 3.10 9.20 ±
% Signal sequences 20 20 24 11 ±
% Transmembrane proteins 20 25 28 15 ±
% Repeat-containing 10 11 9 5 ±
% Coiled-coil 11 13 10 9 ±
...................................................................................................................................................................................................................................................................................................................................................................
The numbers of distinct architectures were calculated using SMART
339
and the percentages of repeat-containing proteins were estimated using Prospero
452
and a P-value threshold of 10
-5
. The protein sets
used in the analysis were taken from http://www.ebi.ac.uk/proteome/ for yeast, worm and ¯y. The proteins from mustard weed were taken from the TAIR website (http:// www.arabidopsis.org/) on 5
September 2000. The protein set was searched against the InterPro database (http://www.ebi.ac.uk/interpro/) using the InterProscan software. Comparison of protein sequences with the InterPro
database allows prediction of protein families, domain and repeat families and sequence motifs. The searches used Pfam release 5.2
307
, Prints release 26.1
326
, Prosite release 16
327
and Prosite preliminary
pro®les. InterPro analysis results are available as Supplementary Information. The fraction of 1-1-1-1 is the percentage of the genome that falls into orthologous groups composed of only one member each
in human, ¯y, worm and yeast.
* The gene number for the human is still uncertain (see text). Table is based on 31,778 known genes and gene predictions.
© 2001 Macmillan Magazines Ltd
These sequences should not represent bacterial contamination in
the draft human sequence, because we ®ltered the sequence to
eliminate sequences that were essentially identical to known bacter-
ial plasmid, transposon or chromosomal DNA (such as the host
strains for the large-insert clones). To investigate whether these were
genuine human sequences, we designed PCR primers for 35 of these
genes and con®rmed that most could be readily detected directly in
human genomic DNA (Table 24). Orthologues of many of these
genes have also been detected in other vertebrates (Table 24).
A more detailed computational analysis indicated that at least 113
of these genes are widespread among bacteria, but, among eukar-
yotes, appear to be present only in vertebrates. It is possible that the
genes encoding these proteins were present in both early prokar-
yotes and eukaryotes, but were lost in each of the lineages of yeast,
worm, ¯y, mustard weed and, possibly, from other nonvertebrate
eukaryote lineages. A more parsimonious explanation is that these
genes entered the vertebrate (or prevertebrate) lineage by horizontal
transfer from bacteria. Many of these genes contain introns, which
presumably were acquired after the putative horizontal transfer
event. Similar observations indicating probable lineage-speci®c
horizontal gene transfers, as well as intron insertion in the acquired
genes, have been made in the worm genome
329
.
We cannot formally exclude the possibility that gene transfer
occurred in the opposite directionÐthat is, that the genes were
invented in the vertebrate lineage and then transferred to bacteria.
However, we consider this less likely. Under this scenario, the broad
distribution of these genes among bacteria would require extensive
horizontal dissemination after their initial acquisition. In addition,
the functional repertoire of these genes, which largely encode
intracellular enzymes (Table 24), is uncharacteristic of vertebrate-
speci®c evolutionary innovations (which appear to be primarily
extracellular proteins; see below).
We did not identify a strongly preferred bacterial source for the
putative horizontally transferred genes, indicating the likelihood
of multiple independent gene transfers from different bacteria
(Table 24). Notably, several of the probable recent acquisitions
have established (or likely) roles in metabolism of xenobiotics or
stress response. These include several hydrolases of different
speci®cities, including epoxide hydrolase, and several dehydro-
genases (Table 24). Of particular interest is the presence of two
paralogues of monoamine oxidase (MAO), an enzyme of the
mitochondrial outer membrane that is central in the metabolism
of neuromediators and is a target of important psychiatric
drugs
330±333
. This example shows that at least some of the genes
thought to be horizontally transferred into the vertebrate lineage
appear to be involved in important physiological functions and so
probably have been ®xed and maintained during evolution because
articles
902 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
Ce
llu
lar
p
ro
ce
ss
es
M
eta
bo
lis
m
DN
A
rep
lic
ati
on
/m
od
ific
ati
on
Tr
an
sc
rip
tio
n/
tra
ns
lat
ion
Int
rac
ell
ula
r s
ign
all
ing
Ce
ll-c
ell
co
mm
un
ica
tio
n
Pr
ot
ein
fo
ldi
ng
an
d
de
gr
ad
ati
on
Tr
an
sp
or
t
M
ult
ifu
nc
tio
na
l p
ro
tei
ns
Cy
to
sk
ele
tal
/st
ru
ctu
ral
De
fen
ce
an
d
im
mu
nit
y
M
isc
ell
an
eo
us
fu
nc
tio
n
N
um
be
r o
f p
ro
te
in
s
Yeast
Mustard
weed
Worm
Fly
Human
Figure 37 Functional categories in eukaryotic proteomes. The classi®cation categories
were derived from functional classi®cation systems, including the top-level biological
function category of the Gene Ontology project (GO; see http://www.geneontology.org).
Animals and
other eukaryotes
32%
Vertebrate only
22%
No animal
homology
1%
Vertebrates and
other animals
24%
Prokaryotes
only
<1% Eukaryote and
prokaryote
21%
Figure 38 Distribution of the homologues of the predicted human proteins. For each
protein, a homologue to a phylogenetic lineage was considered present if a search of the
NCBI nonredundant protein sequence database, using the gapped BLASTP program, gave
a random expectation (E ) value of # 0.001. Additional searches for probable homologues
with lower sequence conservation were performed using the PSI-BLAST program, run for
three iterations using the same cut-off for inclusion of sequences into the pro®le
328
.
Table 23 Properties of genome and proteome in essentially completed eukaryotic proteomes
Human Fly Worm Yeast Mustard weed
Number of identi®ed genes ,32,000* 13,338 18,266 6,144 25,706
% with InterPro matches 51 56 50 50 52
Number of annotated domain families 1,262 1,035 1,014 851 1,010
Number of InterPro entries per gene 0.53 0.84 0.63 0.6 0.62
Number of distinct domain architectures 1,695 1,036 1,018 310 ±
Percentage of 1-1-1-1 1.40 4.20 3.10 9.20 ±
% Signal sequences 20 20 24 11 ±
% Transmembrane proteins 20 25 28 15 ±
% Repeat-containing 10 11 9 5 ±
% Coiled-coil 11 13 10 9 ±
...................................................................................................................................................................................................................................................................................................................................................................
The numbers of distinct architectures were calculated using SMART
339
and the percentages of repeat-containing proteins were estimated using Prospero
452
and a P-value threshold of 10
-5
. The protein sets
used in the analysis were taken from http://www.ebi.ac.uk/proteome/ for yeast, worm and ¯y. The proteins from mustard weed were taken from the TAIR website (http:// www.arabidopsis.org/) on 5
September 2000. The protein set was searched against the InterPro database (http://www.ebi.ac.uk/interpro/) using the InterProscan software. Comparison of protein sequences with the InterPro
database allows prediction of protein families, domain and repeat families and sequence motifs. The searches used Pfam release 5.2
307
, Prints release 26.1
326
, Prosite release 16
327
and Prosite preliminary
pro®les. InterPro analysis results are available as Supplementary Information. The fraction of 1-1-1-1 is the percentage of the genome that falls into orthologous groups composed of only one member each
in human, ¯y, worm and yeast.
* The gene number for the human is still uncertain (see text). Table is based on 31,778 known genes and gene predictions.
© 2001 Macmillan Magazines Ltd
Page 44
of the increased selective advantage(s) they provide.
Genes shared with ¯y, worm and yeast. IPI.1 contains apparent
homologues of 61% of the ¯y proteome, 43% of the worm
proteome and 46% of the yeast proteome. We next considered the
groups of proteins containing likely orthologues and paralogues
(genes that arose from intragenome duplication) in human, ¯y,
worm and yeast.
Brie¯y, we performed all-against-all sequence comparison
334
for
the combined protein sets of human, yeast, ¯y and worm. Pairs of
sequences that were one another's best matches in their respective
genomes were considered to be potential orthologues. These were
then used to identify orthologous groups across three organisms
335
.
Recent species-speci®c paralogues were de®ned by using the all-
against-all sequence comparison to cluster the protein set for each
organism. For each sequence found in an orthologous group, the
recent paralogues were de®ned to be the largest species-speci®c
cluster including it. The set of paralogues may be in¯ated by
unrecognized splice variants and by fragmentation.
We identi®ed 1,308 groups of proteins, each containing at least
one predicted orthologue in each species and many containing
additional paralogues. The 1,308 groups contained 3,129 human
proteins, 1,445 ¯y proteins, 1,503 worm proteins and 1,441 yeast
proteins. These 1,308 groups represent a conserved core of proteins
that are mostly responsible for the basic `housekeeping' functions of
the cell, including metabolism, DNA replication and repair, and
translation.
In 564 of the 1,308 groups, one orthologue (and no additional
paralogues) could be unambiguously assigned for each of human,
¯y, worm and yeast. These groups will be referred to as 1-1-1-1
groups. More than half (305) of these groups could be assigned to
the functional categories shown in Fig. 37. Within these functional
categories, the numbers of groups containing single orthologues
in each of the four proteomes was: 19 for cellular processes, 66
for metabolism, 31 for DNA replication and modi®cation, 106
for transcription/translation, 13 for intracellular signalling, 24
for protein folding and degradation, 38 for transport, 5 for
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 903
Table 24 Probable vertebrate-speci®c acquisitions of bacterial genes
Human protein (accession) Predicted function Known orthologues in
other vertebrates
Bacterial homologues Human origin con®rmed by
PCR
Range Best hit
AAG01853.1 Formiminotransferase
cyclodeaminase
Pig, rat, chicken Thermotoga, Thermoplasma,
Methylobacter
Thermotoga maritima Yes
CAB81772.1 Na/glucose cotransporter Rodents, ungulates Most bacteria Vibrio parahaemolyticus Yes (CAB81772, AAC41747.1)
AAB59448.1 NT* (AAB59448.1,
AAA36608.1)
AAA36608.1
AAC41747.1
BAA1143.21 Epoxide hydrolase (a/b-hydrolase) Mouse, Danio, fugu
®sh
Most bacteria Pseudomonas aeruginosa Yes
CAB59628.1 Protein-methionine-S-oxide reductase Cow Most bacteria Synechocystis sp. Yes
BAA91273.1 Hypertension-associated protein SA/
acetate-CoA ligase
Mouse, rat, cow Most bacteria Bacillus halodurans NT*
CAA75608.1 Glucose-6-phosphate transporter/
glycogen storage disease type 1b
protein
Mouse, rat Most bacteria Chlamydophila pneumoniae Yes
AAA59548.1 Monoamine oxidase Cow, rat, salmon Most bacteria Mycobacterium tuberculosis Yes
AAB27229.1
AAF12736.1 Acyl-CoA dehydrogenase,
mitochondrial protein
Mouse, rat, pig Most bacteria P. aeruginosa Yes
AAA51565.1
IGI_M1_ctg19153_147 Aldose-1-epimerase Pig (also found in
plants)
Streptomyces, Bacillus Streptomyces coelicolor Yes
BAA92632.1 Predicted carboxylase (C-terminal
domain, N-terminal domain unique)
None Streptomyces, Rhizobium,
Bacillus
S. coelicolor Yes
BAA34458.1 Uncharacterized protein None Gamma-proteobacteria Escherichia coli Yes
AAF24044.1 Uncharacterized protein None Most bacteria T. maritima Yes
BAA34458.1 b-Lactamase superfamily hydrolase None Most bacteria Synechocystis sp. Yes
BAA91839.1 Oxidoreductase (Rossmann fold)
fused to a six-transmembrane protein
None (several human
paralogues of both
parts)
Actinomycetes, Leptospira;
more distant homologues in
other bacteria
S. coelicolor Yes
BAA92073.1 Oxidoreductase (Rossmann fold) None Synechocystis, Pseudomonas Synechocystis sp. Yes
BAA92133.1 a/b-hydrolase None Rickettsia; more distant
homologues in other bacteria
Rickettsia prowazekii Yes
BAA91174.1 ADP-ribosylglycohydrolase None Streptomyces, Aquifex,
Archaeoglobus (archaeon),
E. coli
S. coelicolor Yes
AAA60043.1 Thymidine phosporylase/endothelial
cell growth factor
None Most bacteria Bacillus stearothermophilus Yes
BAA86552.1 Ribosomal protein S6-glutamic acid
ligase
None Most bacteria and archaea Haemophilus in¯uenzae Yes
IGI_M1_ctg12741_7 Ribosomal protein S6-glutamic acid
ligase (paralogue of the above)
None Most bacteria and archaea H. in¯uenzae Yes
IGI_M1_ctg13238_61 Hydratase None Synechocystis,
Sphingomonas
Synechocystis sp. Yes
IGI_M1_ctg13305_116 Homologue of histone macro-2A C-
terminal domain, predicted
phosphatase
None (several human
paralogues, RNA
viruses)
Thermotoga, Alcaligenes, E.
coli, more distant homologues
in other bacteria
T. maritima Yes
IGI_M1_ctg14420_10 Sugar transporter None Most bacteria Synechocystis sp. Yes
IGI_M1_ctg16010_18 Predicted metal-binding protein None Most bacteria Borrelia burgdorferi Yes
IGI_M1_ctg16227_58 Pseudouridine synthase None Most bacteria Zymomonas mobilis Yes
IGI_M1_ctg25107_24 Surfactin synthetase domain None Gram-positive bacteria,
Actinomycetes,
Cyanobacteria
Bacillus subtilis Yes
...................................................................................................................................................................................................................................................................................................................................................................
* NT, not tested.
Representative genes con®rmed by PCR to be present in the human genome. The similarity to a bacterial homologue was considered to be `signi®cantly' greater than that to eukaryotic homologues if the
difference in alignment scores returned by BLASTP was greater than 30 bits (,9 orders of magnitude in terms of E-value). A complete, classi®ed and annotated list of probable vertebrate-speci®c horizontal
gene transfers detected in this analysis is available as Supplementary Information. cDNA sequences for each protein were searched, using the SSAHA algorithm, against the draft genome sequence.
Primers were designed and PCR was performed using three human genomic samples and a random BAC clone. The predicted genes were considered to be present in the human genome if a band of the
expected size was found in all three human samples but not in the control clone.
© 2001 Macmillan Magazines Ltd
Genes shared with ¯y, worm and yeast. IPI.1 contains apparent
homologues of 61% of the ¯y proteome, 43% of the worm
proteome and 46% of the yeast proteome. We next considered the
groups of proteins containing likely orthologues and paralogues
(genes that arose from intragenome duplication) in human, ¯y,
worm and yeast.
Brie¯y, we performed all-against-all sequence comparison
334
for
the combined protein sets of human, yeast, ¯y and worm. Pairs of
sequences that were one another's best matches in their respective
genomes were considered to be potential orthologues. These were
then used to identify orthologous groups across three organisms
335
.
Recent species-speci®c paralogues were de®ned by using the all-
against-all sequence comparison to cluster the protein set for each
organism. For each sequence found in an orthologous group, the
recent paralogues were de®ned to be the largest species-speci®c
cluster including it. The set of paralogues may be in¯ated by
unrecognized splice variants and by fragmentation.
We identi®ed 1,308 groups of proteins, each containing at least
one predicted orthologue in each species and many containing
additional paralogues. The 1,308 groups contained 3,129 human
proteins, 1,445 ¯y proteins, 1,503 worm proteins and 1,441 yeast
proteins. These 1,308 groups represent a conserved core of proteins
that are mostly responsible for the basic `housekeeping' functions of
the cell, including metabolism, DNA replication and repair, and
translation.
In 564 of the 1,308 groups, one orthologue (and no additional
paralogues) could be unambiguously assigned for each of human,
¯y, worm and yeast. These groups will be referred to as 1-1-1-1
groups. More than half (305) of these groups could be assigned to
the functional categories shown in Fig. 37. Within these functional
categories, the numbers of groups containing single orthologues
in each of the four proteomes was: 19 for cellular processes, 66
for metabolism, 31 for DNA replication and modi®cation, 106
for transcription/translation, 13 for intracellular signalling, 24
for protein folding and degradation, 38 for transport, 5 for
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 903
Table 24 Probable vertebrate-speci®c acquisitions of bacterial genes
Human protein (accession) Predicted function Known orthologues in
other vertebrates
Bacterial homologues Human origin con®rmed by
PCR
Range Best hit
AAG01853.1 Formiminotransferase
cyclodeaminase
Pig, rat, chicken Thermotoga, Thermoplasma,
Methylobacter
Thermotoga maritima Yes
CAB81772.1 Na/glucose cotransporter Rodents, ungulates Most bacteria Vibrio parahaemolyticus Yes (CAB81772, AAC41747.1)
AAB59448.1 NT* (AAB59448.1,
AAA36608.1)
AAA36608.1
AAC41747.1
BAA1143.21 Epoxide hydrolase (a/b-hydrolase) Mouse, Danio, fugu
®sh
Most bacteria Pseudomonas aeruginosa Yes
CAB59628.1 Protein-methionine-S-oxide reductase Cow Most bacteria Synechocystis sp. Yes
BAA91273.1 Hypertension-associated protein SA/
acetate-CoA ligase
Mouse, rat, cow Most bacteria Bacillus halodurans NT*
CAA75608.1 Glucose-6-phosphate transporter/
glycogen storage disease type 1b
protein
Mouse, rat Most bacteria Chlamydophila pneumoniae Yes
AAA59548.1 Monoamine oxidase Cow, rat, salmon Most bacteria Mycobacterium tuberculosis Yes
AAB27229.1
AAF12736.1 Acyl-CoA dehydrogenase,
mitochondrial protein
Mouse, rat, pig Most bacteria P. aeruginosa Yes
AAA51565.1
IGI_M1_ctg19153_147 Aldose-1-epimerase Pig (also found in
plants)
Streptomyces, Bacillus Streptomyces coelicolor Yes
BAA92632.1 Predicted carboxylase (C-terminal
domain, N-terminal domain unique)
None Streptomyces, Rhizobium,
Bacillus
S. coelicolor Yes
BAA34458.1 Uncharacterized protein None Gamma-proteobacteria Escherichia coli Yes
AAF24044.1 Uncharacterized protein None Most bacteria T. maritima Yes
BAA34458.1 b-Lactamase superfamily hydrolase None Most bacteria Synechocystis sp. Yes
BAA91839.1 Oxidoreductase (Rossmann fold)
fused to a six-transmembrane protein
None (several human
paralogues of both
parts)
Actinomycetes, Leptospira;
more distant homologues in
other bacteria
S. coelicolor Yes
BAA92073.1 Oxidoreductase (Rossmann fold) None Synechocystis, Pseudomonas Synechocystis sp. Yes
BAA92133.1 a/b-hydrolase None Rickettsia; more distant
homologues in other bacteria
Rickettsia prowazekii Yes
BAA91174.1 ADP-ribosylglycohydrolase None Streptomyces, Aquifex,
Archaeoglobus (archaeon),
E. coli
S. coelicolor Yes
AAA60043.1 Thymidine phosporylase/endothelial
cell growth factor
None Most bacteria Bacillus stearothermophilus Yes
BAA86552.1 Ribosomal protein S6-glutamic acid
ligase
None Most bacteria and archaea Haemophilus in¯uenzae Yes
IGI_M1_ctg12741_7 Ribosomal protein S6-glutamic acid
ligase (paralogue of the above)
None Most bacteria and archaea H. in¯uenzae Yes
IGI_M1_ctg13238_61 Hydratase None Synechocystis,
Sphingomonas
Synechocystis sp. Yes
IGI_M1_ctg13305_116 Homologue of histone macro-2A C-
terminal domain, predicted
phosphatase
None (several human
paralogues, RNA
viruses)
Thermotoga, Alcaligenes, E.
coli, more distant homologues
in other bacteria
T. maritima Yes
IGI_M1_ctg14420_10 Sugar transporter None Most bacteria Synechocystis sp. Yes
IGI_M1_ctg16010_18 Predicted metal-binding protein None Most bacteria Borrelia burgdorferi Yes
IGI_M1_ctg16227_58 Pseudouridine synthase None Most bacteria Zymomonas mobilis Yes
IGI_M1_ctg25107_24 Surfactin synthetase domain None Gram-positive bacteria,
Actinomycetes,
Cyanobacteria
Bacillus subtilis Yes
...................................................................................................................................................................................................................................................................................................................................................................
* NT, not tested.
Representative genes con®rmed by PCR to be present in the human genome. The similarity to a bacterial homologue was considered to be `signi®cantly' greater than that to eukaryotic homologues if the
difference in alignment scores returned by BLASTP was greater than 30 bits (,9 orders of magnitude in terms of E-value). A complete, classi®ed and annotated list of probable vertebrate-speci®c horizontal
gene transfers detected in this analysis is available as Supplementary Information. cDNA sequences for each protein were searched, using the SSAHA algorithm, against the draft genome sequence.
Primers were designed and PCR was performed using three human genomic samples and a random BAC clone. The predicted genes were considered to be present in the human genome if a band of the
expected size was found in all three human samples but not in the control clone.
© 2001 Macmillan Magazines Ltd
Page 45
multifunctional proteins and 3 for cytoskeletal/structural. No such
groups were found for defence and immunity or cell±cell commu-
nication.
The 1-1-1-1 groups probably represent key functions that have
not undergone duplication and elaboration in the various lineages.
They include many anabolic enzymes responsible for such functions
as respiratory chain and nucleotide biosynthesis. In contrast, there
are few catabolic enzymes. As anabolic pathways branch less
frequently than catabolic pathways, this indicates that alternative
routes and displacements are more frequent in catabolic reactions.
If proteins from the single-celled yeast are excluded from the
analysis, there are 1,195 1-1-1 groups. The additional groups
include many examples of more complex signalling proteins, such
as receptor-type and src-like tyrosine kinases, likely to have arisen
early in the metazoan lineage. The fact that this set comprises only a
small proportion of the proteome of each of the animals indicates
that, apart from a modest conserved core, there has been extensive
elaboration and innovation within the protein complement.
Most proteins do not show simple 1-1-1 orthologous relation-
ships across the three animals. To illustrate this, we investigated the
nuclear hormone receptor family. In the human proteome, this
family consists of 60 different `classical' members, each with a zinc
®nger and a ligand-binding domain. In comparison, the ¯y pro-
teome has 19 and the worm proteome has 220. As shown in Fig. 39,
few simple orthologous relationships can be derived among these
homologues. And, where potential subgroups of orthologues and
paralogues could be identi®ed, it was apparent that the functions of
the subgroup members could differ signi®cantly. For example, the
¯y receptor for the ¯y-speci®c hormone ecdysone and the human
retinoic acid receptors cluster together on the basis of sequence
similarity. Such examples underscore that the assignment of func-
tional similarity on the basis of sequence similarities among these
three organisms is not trivial in most cases.
New vertebrate domains and proteins. We then explored how the
proteome of vertebrates (as represented by the human) differs from
those of the other species considered. The 1,262 InterPro families
were scanned to identify those that contain only vertebrate proteins.
Only 94 (7%) of the families were `vertebrate-speci®c'. These
represent 70 protein families and 24 domain families. Only one of
the 94 families represents enzymes, which is consistent with the
ancient origins of most enzymes
336
. The single vertebrate-speci®c
enzyme family identi®ed was the pancreatic or eosinophil-asso-
ciated ribonucleases. These enzymes evolved rapidly, possibly to
combat vertebrate pathogens
337
.
The relatively small proportion of vertebrate-speci®c multicopy
families suggests that few new protein domains have been invented
in the vertebrate lineage, and that most protein domains trace at
least as far back as a common animal ancestor. This conclusion must
be tempered by the fact that the InterPro classi®cation system is
incomplete; additional vertebrate-speci®c families undoubtedly
exist that have not yet been recognized in the InterPro system.
The 94 vertebrate-speci®c families appear to re¯ect important
physiological differences between vertebrates and other eukaryotes.
Defence and immunity proteins (23 families) and proteins that
function in the nervous system (17 families) are particularly
enriched in this set. These data indicate the recent emergence or
rapid divergence of these proteins.
Representative human proteins were previously known for nearly
all of the vertebrate-speci®c families. This was not surprising, given
the anthropocentrism of biological research. However, the analysis
did identify the ®rst mammalian proteins belonging to two of these
families. Both of these families were originally de®ned in ®sh. The
®rst is the family of polar ®sh antifreeze III proteins. We found a
human sialic acid synthase containing a domain homologous to
polar ®sh antifreeze III protein (BAA91818.1). This ®nding suggests
that ®sh created the antifreeze function by adaptation of this
domain. We also found a human protein (CAB60269.1) homo-
logous to the ependymin found in teleost ®sh. Ependymins are
major glycoproteins of ®sh brains that have been claimed to be
involved in long-term memory formation
338
. The function of the
mammalian ependymin homologue will need to be elucidated.
New architectures from old domains. Whereas there appears to be
only modest invention at the level of new vertebrate protein
domains, there appears to be substantial innovation in the creation
of new vertebrate proteins. This innovation is evident at the level of
domain architecture, de®ned as the linear arrangement of domains
within a polypeptide. New architectures can be created by shuf¯ing,
adding or deleting domains, resulting in new proteins from old
parts.
We quanti®ed the number of distinct protein architectures found
in yeast, worm, ¯y and human by using the SMART annotation
resource
339
(Fig. 40). The human proteome set contained 1.8 times
as many protein architectures as worm or ¯y and 5.8 times as many
as yeast. This difference is most prominent in the recent evolution of
novel extracellular and transmembrane architectures in the human
lineage. Human extracellular proteins show the greatest innovation:
the human has 2.3 times as many extracellular architectures as ¯y
and 2.0 times as many as worm. The larger number of human
architectures does not simply re¯ect differences in the number of
domains known in these organisms; the result remains qualitatively
the same even if the number of architectures in each organism is
normalized by dividing by the total number of domains (not
shown). (We also checked that the larger number of human
articles
904 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Worm
Fly
Human
Retinoic acids
Steroid
hormone
Vitamin D3
(7)
(>200)
(2)
(3)
(2)
(4)
(3) (6)
(3)
(3)
(3)
(3)
(7)
(4)
(2)
(3)
(2)
(2)
Hepatocyte nuclear factors
Thyroid hormone
Peroxisome
proliferator activated
Apolipoprotein
regulatory protein
(3)
(3)Retinoic acids
Ecdysone
Ecdysone
Figure 39 Simpli®ed cladogram (relationship tree) of the `many-to-many' relationships of
classical nuclear receptors. Triangles indicate expansion within one lineage; bars
represent single members. Numbers in parentheses indicate the number of paralogues in
each group.
© 2001 Macmillan Magazines Ltd
groups were found for defence and immunity or cell±cell commu-
nication.
The 1-1-1-1 groups probably represent key functions that have
not undergone duplication and elaboration in the various lineages.
They include many anabolic enzymes responsible for such functions
as respiratory chain and nucleotide biosynthesis. In contrast, there
are few catabolic enzymes. As anabolic pathways branch less
frequently than catabolic pathways, this indicates that alternative
routes and displacements are more frequent in catabolic reactions.
If proteins from the single-celled yeast are excluded from the
analysis, there are 1,195 1-1-1 groups. The additional groups
include many examples of more complex signalling proteins, such
as receptor-type and src-like tyrosine kinases, likely to have arisen
early in the metazoan lineage. The fact that this set comprises only a
small proportion of the proteome of each of the animals indicates
that, apart from a modest conserved core, there has been extensive
elaboration and innovation within the protein complement.
Most proteins do not show simple 1-1-1 orthologous relation-
ships across the three animals. To illustrate this, we investigated the
nuclear hormone receptor family. In the human proteome, this
family consists of 60 different `classical' members, each with a zinc
®nger and a ligand-binding domain. In comparison, the ¯y pro-
teome has 19 and the worm proteome has 220. As shown in Fig. 39,
few simple orthologous relationships can be derived among these
homologues. And, where potential subgroups of orthologues and
paralogues could be identi®ed, it was apparent that the functions of
the subgroup members could differ signi®cantly. For example, the
¯y receptor for the ¯y-speci®c hormone ecdysone and the human
retinoic acid receptors cluster together on the basis of sequence
similarity. Such examples underscore that the assignment of func-
tional similarity on the basis of sequence similarities among these
three organisms is not trivial in most cases.
New vertebrate domains and proteins. We then explored how the
proteome of vertebrates (as represented by the human) differs from
those of the other species considered. The 1,262 InterPro families
were scanned to identify those that contain only vertebrate proteins.
Only 94 (7%) of the families were `vertebrate-speci®c'. These
represent 70 protein families and 24 domain families. Only one of
the 94 families represents enzymes, which is consistent with the
ancient origins of most enzymes
336
. The single vertebrate-speci®c
enzyme family identi®ed was the pancreatic or eosinophil-asso-
ciated ribonucleases. These enzymes evolved rapidly, possibly to
combat vertebrate pathogens
337
.
The relatively small proportion of vertebrate-speci®c multicopy
families suggests that few new protein domains have been invented
in the vertebrate lineage, and that most protein domains trace at
least as far back as a common animal ancestor. This conclusion must
be tempered by the fact that the InterPro classi®cation system is
incomplete; additional vertebrate-speci®c families undoubtedly
exist that have not yet been recognized in the InterPro system.
The 94 vertebrate-speci®c families appear to re¯ect important
physiological differences between vertebrates and other eukaryotes.
Defence and immunity proteins (23 families) and proteins that
function in the nervous system (17 families) are particularly
enriched in this set. These data indicate the recent emergence or
rapid divergence of these proteins.
Representative human proteins were previously known for nearly
all of the vertebrate-speci®c families. This was not surprising, given
the anthropocentrism of biological research. However, the analysis
did identify the ®rst mammalian proteins belonging to two of these
families. Both of these families were originally de®ned in ®sh. The
®rst is the family of polar ®sh antifreeze III proteins. We found a
human sialic acid synthase containing a domain homologous to
polar ®sh antifreeze III protein (BAA91818.1). This ®nding suggests
that ®sh created the antifreeze function by adaptation of this
domain. We also found a human protein (CAB60269.1) homo-
logous to the ependymin found in teleost ®sh. Ependymins are
major glycoproteins of ®sh brains that have been claimed to be
involved in long-term memory formation
338
. The function of the
mammalian ependymin homologue will need to be elucidated.
New architectures from old domains. Whereas there appears to be
only modest invention at the level of new vertebrate protein
domains, there appears to be substantial innovation in the creation
of new vertebrate proteins. This innovation is evident at the level of
domain architecture, de®ned as the linear arrangement of domains
within a polypeptide. New architectures can be created by shuf¯ing,
adding or deleting domains, resulting in new proteins from old
parts.
We quanti®ed the number of distinct protein architectures found
in yeast, worm, ¯y and human by using the SMART annotation
resource
339
(Fig. 40). The human proteome set contained 1.8 times
as many protein architectures as worm or ¯y and 5.8 times as many
as yeast. This difference is most prominent in the recent evolution of
novel extracellular and transmembrane architectures in the human
lineage. Human extracellular proteins show the greatest innovation:
the human has 2.3 times as many extracellular architectures as ¯y
and 2.0 times as many as worm. The larger number of human
architectures does not simply re¯ect differences in the number of
domains known in these organisms; the result remains qualitatively
the same even if the number of architectures in each organism is
normalized by dividing by the total number of domains (not
shown). (We also checked that the larger number of human
articles
904 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Worm
Fly
Human
Retinoic acids
Steroid
hormone
Vitamin D3
(7)
(>200)
(2)
(3)
(2)
(4)
(3) (6)
(3)
(3)
(3)
(3)
(7)
(4)
(2)
(3)
(2)
(2)
Hepatocyte nuclear factors
Thyroid hormone
Peroxisome
proliferator activated
Apolipoprotein
regulatory protein
(3)
(3)Retinoic acids
Ecdysone
Ecdysone
Figure 39 Simpli®ed cladogram (relationship tree) of the `many-to-many' relationships of
classical nuclear receptors. Triangles indicate expansion within one lineage; bars
represent single members. Numbers in parentheses indicate the number of paralogues in
each group.
© 2001 Macmillan Magazines Ltd
Page 46
architectures could not be an artefact resulting from erroneous gene
predictions. Three-quarters of the architectures can be found in
known genes, which already yields an increase of about 50% over
worm and ¯y. We expect the ®nal number of human architectures to
grow as the complete gene set is identi®ed.)
A related measure of proteome complexity can be obtained by
considering an individual domain and counting the number of
different domain types with which it co-occurs. For example,
the trypsin-like serine protease domain (number 12 in Fig. 41)
co-occurs with 18 domain types in human (including proteins
involved in the mammalian complement system, blood coagulation,
and ®brinolytic and related systems). By contrast, the trypsin-like
serine protease domain occurs with only eight other domains in ¯y,
®ve in worm and one in yeast. Similar results for 27 common domains
are shown in Fig. 41. In general, there are more different co-occurring
domains in the human proteome than in the other proteomes.
One mechanism by which architectures evolve is through the
fusion of additional domains, often at one or both ends of the
proteins. Such `domain accretion'
340
is seen in many human proteins
when compared with proteins from other eukaryotes. The effect is
illustrated by several chromatin-associated proteins (Fig. 42). In
these examples, the domain architectures of human proteins differ
from those found in yeast, worm and ¯y proteins only by the
addition of domains at their termini.
Among chromatin-associated proteins and transcription factors,
a signi®cant proportion of domain architectures is shared between
the vertebrate and ¯y, but not with worm (Fig. 43a). The trend was
even more prominent in architectures of proteins involved in
another key cellular process, programmed cell death (Fig. 43b).
These examples might seem to bear upon the unresolved issue of the
evolutionary branching order of worms, ¯ies and humans, suggest-
ing that worms branched off ®rst. However, there were other cases in
which worms and humans shared architectures not present in ¯y. A
global analysis of shared architectures could not conclusively
distinguish between the two models, given the possibility of line-
age-speci®c loss of architectures. Comparison of protein architec-
tures may help to resolve the evolutionary issue, but it will require
more detailed analyses of many protein families.
New physiology from old proteins. An important aspect of
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 905
1 2 3 4 5 6 7 8 9 10 1112 131415161718 19 20 21 22 23 24 25 26 27
0
10
20
30
40
50
60
70
Mustard
weed
Yeast
Worm
Fly
Human
N
um
be
r o
f c
o-
oc
cu
rr
in
g
do
m
ai
ns
Domain family
Figure 41 Number of different Pfam domain types that co-occur in the same protein, for
each of the 10 most common domain families in each of the ®ve eukaryotic proteomes.
Because some common domain families are shared, there are 27 families rather than 50.
The data are ranked according to decreasing numbers of human co-occurring Pfam
domains. The domain families are: (1) eukaryotic protein kinase [IPR000719];
(2) immunoglobulin domain [IPR003006]; (3) ankyrin repeat [IPR002110]; (4) RING ®nger
[IPR001841]; (5) C2H2-type zinc ®nger [IPR000822]; (6) ATP/GTP-binding P-loop
[IPR001687]; (7) reverse transcriptase (RNA-dependent DNA polymerase) [IPR000477];
(8) leucine-rich repeat [IPR001611]; (9) G-proteinb WD-40 repeats [IPR001680];
(10) RNA-binding region RNP-1 (RNA recognition motif) [IPR000504]; (11) C-type lectin
domain [IPR001304]; (12) serine proteases, trypsin family [IPR001254]; (13) helicase
C-terminal domain [IPR001650]; (14) collagen triple helix repeat [IPR000087];
(15) rhodopsin-like GPCR superfamily [IPR000276]; (16) esterase/lipase/thioesterase
[IPR000379]; (17) Myb DNA-binding domain [IPR001005]; (18) F-box domain
[IPR001810]; (19) ATP-binding transport protein, 2nd P-loop motif [IPR001051];
(20) homeobox domain [IPR001356]; (21) C4-type steroid receptor zinc ®nger
[IPR001628]; (22) sugar transporter [IPR001066]; (23) PPR repeats [IPR002885];
(24) seven-helix G-protein-coupled receptor, worm (probably olfactory) family [IPR000168];
(25) cytochrome P450 enzyme [IPR001128]; (26) fungal transcriptional regulatory protein,
N terminus [IPR001138]; (27) domain of unknown function DUF38 [IPR002900].
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
Human Fly Worm Yeast
Transmembrane
Extracellular
Intracellular
N
um
be
r o
f d
is
tin
ct
d
om
ai
n
ar
ch
ite
ct
ur
es
Figure 40 Number of distinct domain architectures in the four eukaryotic genomes,
predicted using SMART
339
. The number of architectures is split into three cellular
environments: intracellular, extracellular and membrane-associated. The increase in
architectures for the human, relative to the other lineages, is seen when these numbers
are normalized with respect to the numbers of domains predicted in each phylum. To
avoid artefactual results from the relatively low detection rate for some repeat types,
tandem occurrences of tetratricopeptide, armadillo, EF-hand, leucine-rich, WD40 or
ankyrin repeats or C2H2-type zinc ®ngers were treated as single occurrences.
© 2001 Macmillan Magazines Ltd
predictions. Three-quarters of the architectures can be found in
known genes, which already yields an increase of about 50% over
worm and ¯y. We expect the ®nal number of human architectures to
grow as the complete gene set is identi®ed.)
A related measure of proteome complexity can be obtained by
considering an individual domain and counting the number of
different domain types with which it co-occurs. For example,
the trypsin-like serine protease domain (number 12 in Fig. 41)
co-occurs with 18 domain types in human (including proteins
involved in the mammalian complement system, blood coagulation,
and ®brinolytic and related systems). By contrast, the trypsin-like
serine protease domain occurs with only eight other domains in ¯y,
®ve in worm and one in yeast. Similar results for 27 common domains
are shown in Fig. 41. In general, there are more different co-occurring
domains in the human proteome than in the other proteomes.
One mechanism by which architectures evolve is through the
fusion of additional domains, often at one or both ends of the
proteins. Such `domain accretion'
340
is seen in many human proteins
when compared with proteins from other eukaryotes. The effect is
illustrated by several chromatin-associated proteins (Fig. 42). In
these examples, the domain architectures of human proteins differ
from those found in yeast, worm and ¯y proteins only by the
addition of domains at their termini.
Among chromatin-associated proteins and transcription factors,
a signi®cant proportion of domain architectures is shared between
the vertebrate and ¯y, but not with worm (Fig. 43a). The trend was
even more prominent in architectures of proteins involved in
another key cellular process, programmed cell death (Fig. 43b).
These examples might seem to bear upon the unresolved issue of the
evolutionary branching order of worms, ¯ies and humans, suggest-
ing that worms branched off ®rst. However, there were other cases in
which worms and humans shared architectures not present in ¯y. A
global analysis of shared architectures could not conclusively
distinguish between the two models, given the possibility of line-
age-speci®c loss of architectures. Comparison of protein architec-
tures may help to resolve the evolutionary issue, but it will require
more detailed analyses of many protein families.
New physiology from old proteins. An important aspect of
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 905
1 2 3 4 5 6 7 8 9 10 1112 131415161718 19 20 21 22 23 24 25 26 27
0
10
20
30
40
50
60
70
Mustard
weed
Yeast
Worm
Fly
Human
N
um
be
r o
f c
o-
oc
cu
rr
in
g
do
m
ai
ns
Domain family
Figure 41 Number of different Pfam domain types that co-occur in the same protein, for
each of the 10 most common domain families in each of the ®ve eukaryotic proteomes.
Because some common domain families are shared, there are 27 families rather than 50.
The data are ranked according to decreasing numbers of human co-occurring Pfam
domains. The domain families are: (1) eukaryotic protein kinase [IPR000719];
(2) immunoglobulin domain [IPR003006]; (3) ankyrin repeat [IPR002110]; (4) RING ®nger
[IPR001841]; (5) C2H2-type zinc ®nger [IPR000822]; (6) ATP/GTP-binding P-loop
[IPR001687]; (7) reverse transcriptase (RNA-dependent DNA polymerase) [IPR000477];
(8) leucine-rich repeat [IPR001611]; (9) G-proteinb WD-40 repeats [IPR001680];
(10) RNA-binding region RNP-1 (RNA recognition motif) [IPR000504]; (11) C-type lectin
domain [IPR001304]; (12) serine proteases, trypsin family [IPR001254]; (13) helicase
C-terminal domain [IPR001650]; (14) collagen triple helix repeat [IPR000087];
(15) rhodopsin-like GPCR superfamily [IPR000276]; (16) esterase/lipase/thioesterase
[IPR000379]; (17) Myb DNA-binding domain [IPR001005]; (18) F-box domain
[IPR001810]; (19) ATP-binding transport protein, 2nd P-loop motif [IPR001051];
(20) homeobox domain [IPR001356]; (21) C4-type steroid receptor zinc ®nger
[IPR001628]; (22) sugar transporter [IPR001066]; (23) PPR repeats [IPR002885];
(24) seven-helix G-protein-coupled receptor, worm (probably olfactory) family [IPR000168];
(25) cytochrome P450 enzyme [IPR001128]; (26) fungal transcriptional regulatory protein,
N terminus [IPR001138]; (27) domain of unknown function DUF38 [IPR002900].
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
Human Fly Worm Yeast
Transmembrane
Extracellular
Intracellular
N
um
be
r o
f d
is
tin
ct
d
om
ai
n
ar
ch
ite
ct
ur
es
Figure 40 Number of distinct domain architectures in the four eukaryotic genomes,
predicted using SMART
339
. The number of architectures is split into three cellular
environments: intracellular, extracellular and membrane-associated. The increase in
architectures for the human, relative to the other lineages, is seen when these numbers
are normalized with respect to the numbers of domains predicted in each phylum. To
avoid artefactual results from the relatively low detection rate for some repeat types,
tandem occurrences of tetratricopeptide, armadillo, EF-hand, leucine-rich, WD40 or
ankyrin repeats or C2H2-type zinc ®ngers were treated as single occurrences.
© 2001 Macmillan Magazines Ltd
Page 47
vertebrate innovation lies in the expansion of protein families. Table
25 shows the most prevalent protein domains and protein families
in humans, together with their relative ranks in the other species.
About 60% of families are more numerous in the human than in any
of the other four organisms. This shows that gene duplication has
been a major evolutionary force during vertebrate evolution. A
comparison of relative expansions in human versus ¯y is shown in
Fig. 44.
Many of the families that are expanded in human relative to ¯y
and worm are involved in distinctive aspects of vertebrate physiol-
ogy. An example is the family of immunoglobulin (IG) domains,
®rst identi®ed in antibodies thirty years ago. Classic (as opposed to
divergent) IG domains are completely absent from the yeast and
mustard weed proteomes and, although prokaryotic homologues
exist, they have probably been transferred horizontally from
metazoans
341
. Most IG superfamily proteins in invertebrates are
cell-surface proteins. In vertebrates, the IG repertoire includes
immune functions such as those of antibodies, MHC proteins,
antibody receptors and many lymphocyte cell-surface proteins. The
large expansion of IG domains in vertebrates shows the versatility of
a single family in evoking rapid and effective response to infection.
Two prominent families are involved in the control of develop-
ment. The human genome contains 30 ®broblast growth factors
(FGFs), as opposed to two FGFs each in the ¯y and worm. It
contains 42 transforming growth factor-bs (TGFbs) compared with
nine and six in the ¯y and worm, respectively. These growth factors
are involved in organogenesis, such as that of the liver and the lung.
A ¯y FGF protein, branchless, is involved in developing respiratory
organs (tracheae) in embryos
342
. Thus, developmental triggers of
morphogenesis in vertebrates have evolved from related but simpler
systems in invertebrates
343
.
Another example is the family of intermediate ®lament proteins,
with 127 family members. This expansion is almost entirely due to
111 keratins, which are chordate-speci®c intermediate ®lament
proteins that form ®laments in epithelia. The large number of
human keratins suggests multiple cellular structural support roles
for the many specialized epithelia of vertebrates.
Finally, the olfactory receptor genes comprise a huge gene family
of about 1,000 genes and pseudogenes
344,345
. The number of olfac-
tory receptors testi®es to the importance of the sense of smell in
vertebrates. A total of 906 olfactory receptor genes and pseudogenes
could be identi®ed in the draft genome sequence, two-thirds of
which were not previously annotated. About 80% are found in
about two dozen clusters ranging from 6 to 138 genes and encom-
passing about 30 Mb (,1%) of the human genome. Despite the
importance of smell among our vertebrate ancestors, hominids
articles
906 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
a
b
c
Br Br Br Br Br Br Br BrBa Ba hmg Zn
f
Zn
f
Zn
f
A
Y
RSC1/2 CG11375* * *
Ep1 Ep1 Ep1P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
Ep2
Ep1 Ep2
Ep2 Ep2Br
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
DBr Br
Br BMB Ch Ch Ch Ch SWI2SWI2 Sa Sahmg
Common
ancestor
A A Asja SET SETC CC C sja C CC C
sja SETC CC C
Me
hmg
Y W WF, H F, H
YPR031w
Y, F, V
E(Pc)-like
Lin-49 peregrin CHD-3/T14G8.1 Mi-2
****
* * *
*
F
H
Trx
W, F, H
ALR
2...5
Hrx
W, F, V
Figure 42 Examples of domain accretion in chromatin proteins. Domain accretion in
various lineages before the animal divergence, in the apparent coelomate lineage and the
vertebrate lineage are shown using schematic representations of domain architectures
(not to scale). Asterisks, mobile domains that have participated in the accretion. Species
in which a domain architecture has been identi®ed are indicated above the diagram
(Y, yeast; W, worm; F, ¯y; V, vertebrate). Protein names are below the diagrams. The
domains are SET, a chromatin protein methyltransferase domain; SWI2, a superfamily II
helicase/ATPase domain; Sa, sant domain; Br, bromo domain; Ch, chromodomain; C, a
cysteine triad motif associated with the Msl-2 and SET domains; A, AT hook motif; EP1/
EP2, enhancer of polycomb domains 1 and 2; Znf, zinc ®nger; sja, SET-JOR-associated
domain (L. Aravind, unpublished); Me, DNA methylase/Hrx-associated DNA binding zinc
®nger; Ba, bromo-associated homology motif. a±c, Different examples of accretion.
Conserved domain architectures in apoptotic proteins
16
3
2
10
Conserved domain architectures in chromatin proteins
31
8
3
60
a
b
Human and fly
Human and worm
Worm and fly
All three
Human and fly
Human and worm
Worm and fly
All three
Figure 43 Conservation of architectures between animal species. The pie charts illustrate
the shared domain architectures of apparent orthologues that are conserved in at least
two of the three sequenced animal genomes. If an architecture was detected in fungi or
plants, as well as two of the animal lineages, it was omitted as ancient and its absence in
the third animal lineage attributed to gene loss. a, Chromatin-associated proteins.
b, Components of the programmed cell death system.
© 2001 Macmillan Magazines Ltd
25 shows the most prevalent protein domains and protein families
in humans, together with their relative ranks in the other species.
About 60% of families are more numerous in the human than in any
of the other four organisms. This shows that gene duplication has
been a major evolutionary force during vertebrate evolution. A
comparison of relative expansions in human versus ¯y is shown in
Fig. 44.
Many of the families that are expanded in human relative to ¯y
and worm are involved in distinctive aspects of vertebrate physiol-
ogy. An example is the family of immunoglobulin (IG) domains,
®rst identi®ed in antibodies thirty years ago. Classic (as opposed to
divergent) IG domains are completely absent from the yeast and
mustard weed proteomes and, although prokaryotic homologues
exist, they have probably been transferred horizontally from
metazoans
341
. Most IG superfamily proteins in invertebrates are
cell-surface proteins. In vertebrates, the IG repertoire includes
immune functions such as those of antibodies, MHC proteins,
antibody receptors and many lymphocyte cell-surface proteins. The
large expansion of IG domains in vertebrates shows the versatility of
a single family in evoking rapid and effective response to infection.
Two prominent families are involved in the control of develop-
ment. The human genome contains 30 ®broblast growth factors
(FGFs), as opposed to two FGFs each in the ¯y and worm. It
contains 42 transforming growth factor-bs (TGFbs) compared with
nine and six in the ¯y and worm, respectively. These growth factors
are involved in organogenesis, such as that of the liver and the lung.
A ¯y FGF protein, branchless, is involved in developing respiratory
organs (tracheae) in embryos
342
. Thus, developmental triggers of
morphogenesis in vertebrates have evolved from related but simpler
systems in invertebrates
343
.
Another example is the family of intermediate ®lament proteins,
with 127 family members. This expansion is almost entirely due to
111 keratins, which are chordate-speci®c intermediate ®lament
proteins that form ®laments in epithelia. The large number of
human keratins suggests multiple cellular structural support roles
for the many specialized epithelia of vertebrates.
Finally, the olfactory receptor genes comprise a huge gene family
of about 1,000 genes and pseudogenes
344,345
. The number of olfac-
tory receptors testi®es to the importance of the sense of smell in
vertebrates. A total of 906 olfactory receptor genes and pseudogenes
could be identi®ed in the draft genome sequence, two-thirds of
which were not previously annotated. About 80% are found in
about two dozen clusters ranging from 6 to 138 genes and encom-
passing about 30 Mb (,1%) of the human genome. Despite the
importance of smell among our vertebrate ancestors, hominids
articles
906 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
a
b
c
Br Br Br Br Br Br Br BrBa Ba hmg Zn
f
Zn
f
Zn
f
A
Y
RSC1/2 CG11375* * *
Ep1 Ep1 Ep1P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
Ep2
Ep1 Ep2
Ep2 Ep2Br
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
D
P
H
DBr Br
Br BMB Ch Ch Ch Ch SWI2SWI2 Sa Sahmg
Common
ancestor
A A Asja SET SETC CC C sja C CC C
sja SETC CC C
Me
hmg
Y W WF, H F, H
YPR031w
Y, F, V
E(Pc)-like
Lin-49 peregrin CHD-3/T14G8.1 Mi-2
****
* * *
*
F
H
Trx
W, F, H
ALR
2...5
Hrx
W, F, V
Figure 42 Examples of domain accretion in chromatin proteins. Domain accretion in
various lineages before the animal divergence, in the apparent coelomate lineage and the
vertebrate lineage are shown using schematic representations of domain architectures
(not to scale). Asterisks, mobile domains that have participated in the accretion. Species
in which a domain architecture has been identi®ed are indicated above the diagram
(Y, yeast; W, worm; F, ¯y; V, vertebrate). Protein names are below the diagrams. The
domains are SET, a chromatin protein methyltransferase domain; SWI2, a superfamily II
helicase/ATPase domain; Sa, sant domain; Br, bromo domain; Ch, chromodomain; C, a
cysteine triad motif associated with the Msl-2 and SET domains; A, AT hook motif; EP1/
EP2, enhancer of polycomb domains 1 and 2; Znf, zinc ®nger; sja, SET-JOR-associated
domain (L. Aravind, unpublished); Me, DNA methylase/Hrx-associated DNA binding zinc
®nger; Ba, bromo-associated homology motif. a±c, Different examples of accretion.
Conserved domain architectures in apoptotic proteins
16
3
2
10
Conserved domain architectures in chromatin proteins
31
8
3
60
a
b
Human and fly
Human and worm
Worm and fly
All three
Human and fly
Human and worm
Worm and fly
All three
Figure 43 Conservation of architectures between animal species. The pie charts illustrate
the shared domain architectures of apparent orthologues that are conserved in at least
two of the three sequenced animal genomes. If an architecture was detected in fungi or
plants, as well as two of the animal lineages, it was omitted as ancient and its absence in
the third animal lineage attributed to gene loss. a, Chromatin-associated proteins.
b, Components of the programmed cell death system.
© 2001 Macmillan Magazines Ltd
Page 48
appear to have considerably less interest in this sense. About 60% of
the olfactory receptors in the draft genome sequence have disrupted
ORFs and appear to be pseudogenes, consistent with recent
reports
344,346
suggesting massive functional gene loss in the last 10
Myr
347,348
. Interestingly, there appears to be a much higher propor-
tion of intact genes among class I than class II olfactory receptors,
suggesting functional importance.
Vertebrates are not unique in employing gene family expansion.
For many domain types, expansions appear to have occurred
independently in each of the major eukaryotic lineages. A good
example is the classical C2H2 family of zinc ®nger domains, which
have expanded independently in the yeast, worm, ¯y and human
lineages (Fig. 45). These independent expansions have resulted in
numerous C2H2 zinc ®nger domain-containing proteins that are
speci®c to each lineage. In ¯ies, the important components of the
C2H2 zinc ®nger expansion are architectures in which it is com-
bined with the POZ domain and the C4DM domain (a metal-
binding domain found only in ¯y). In humans, the most prevalent
expansions are combinations of the C2H2 zinc ®nger with POZ
(independent of the one in insects) and the vertebrate-speci®c
KRAB and SCAN domains.
The homeodomain is similarly expanded in all animals and is
present in both architectures that are conserved and lineage-speci®c
architectures (Fig. 45). This indicates that the ancestral animal
probably encoded a signi®cant number of homeodomain proteins,
but subsequent evolution involved multiple, independent expan-
sions and domain shuf¯ing after lineages diverged. Thus, the most
prevalent transcription factor families are different in worm, ¯y and
human (Fig. 45). This has major biological implications because
transcription factors are critical in animal development and differ-
entiation. The emergence of major variations in the developmental
body plans that accompanied the early radiation of the animals
349
could have been driven by lineage-speci®c proliferation of such
transcription factors. Beyond these large expansions of protein
families, protein components of particular functional systems
such as the cell death signalling system show a general increase in
diversity and numbers in the vertebrates relative to other animals.
For example, there are greater numbers of and more novel archi-
tectures in cell death regulatory proteins such as BCL-2, TNFR and
NFkB from vertebrates.
Conclusion. Five lines of evidence point to an increase in the
complexity of the proteome from the single-celled yeast to the
multicellular invertebrates and to vertebrates such as the human.
Speci®cally, the human contains greater numbers of genes, domain
and protein families, paralogues, multidomain proteins with
multiple functions, and domain architectures. According to these
measures, the relatively greater complexity of the human proteome
is a consequence not simply of its larger size, but also of large-scale
protein innovation.
An important question is the extent to which the greater
phenotypic complexity of vertebrates can be explained simply by
two- or threefold increases in proteome complexity. The real
explanation may lie in combinatorial ampli®cation of these
modest differences, by mechanisms that include alternative splicing,
post-translational modi®cation and cellular regulatory networks.
The potential numbers of different proteins and protein±protein
interactions are vast, and their actual numbers cannot readily be
discerned from the genome sequence. Elucidating such system-
level properties presents one of the great challenges for modern
biology.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 907
Table 25 The most populous InterPro families in the human proteome and other species
Human Fly Worm Yeast Mustard weed
InterPro ID No. of
genes
Rank No. of
genes
Rank No. of
genes
Rank No. of
genes
Rank No. of
genes
Rank
IPR003006 765 (1) 140 (9) 64 (34) 0 (na) 0 (na) Immunoglobulin domain
PR000822 706 (2) 357 (1) 151 (10) 48 (7) 115 (20) C2H2 zinc ®nger
IPR000719 575 (3) 319 (2) 437 (2) 121 (1) 1049 (1) Eukaryotic protein kinase
IPR000276 569 (4) 97 (14) 358 (3) 0 (na) 16 (84) Rhodopsin-like GPCR superfamily
IPR001687 433 (5) 198 (4) 183 (7) 97 (2) 331 (5) P-loop motif
IPR000477 350 (6) 10 (65) 50 (41) 6 (36) 80 (35) Reverse transcriptase (RNA-dependent DNA polymerase)
IPR000504 300 (7) 157 (6) 96 (21) 54 (6) 255 (8) rrm domain
IPR001680 277 (8) 162 (5) 102 (19) 91 (3) 210 (10) G-protein b WD-40 repeats
IPR002110 276 (9) 105 (13) 107 (17) 19 (23) 120 (18) Ankyrin repeat
IPR001356 267 (10) 148 (7) 109 (15) 9 (33) 118 (19) Homeobox domain
IPR001849 252 (11) 77 (22) 71 (31) 27 (17) 27 (73) PH domain
IPR002048 242 (12) 111 (12) 81 (25) 15 (27) 167 (12) EF-hand family
IPR000561 222 (13) 81 (20) 113 (14) 0 (na) 17 (83) EGF-like domain
IPR001452 215 (14) 72 (23) 62 (35) 25 (18) 3 (97) SH3 domain
IPR001841 210 (15) 114 (11) 126 (12) 35 (12) 379 (4) RING ®nger
IPR001611 188 (16) 115 (10) 54 (38) 7 (35) 392 (2) Leucine-rich repeat
IPR001909 171 (17) 0 (na) 0 (na) 0 (na) 0 (na) KRAB box
IPR001777 165 (18) 63 (27) 51 (40) 2 (40) 4 (96) Fibronectin type III domain
IPR001478 162 (19) 70 (24) 66 (33) 2 (40) 15 (85) PDZ domain
IPR001650 155 (20) 87 (17) 78 (27) 79 (4) 148 (13) Helicase C-terminal domain
IPR001440 150 (21) 86 (18) 46 (43) 36 (11) 125 (17) TPR repeat
IPR002216 133 (22) 65 (26) 99 (20) 2 (40) 31 (69) Ion transport protein
IPR001092 131 (23) 84 (19) 41 (46) 7 (35) 106 (24) Helix±loop±helix DNA-binding domain
IPR000008 123 (24) 43 (34) 36 (49) 9 (33) 82 (34) C2 domain
IPR001664 119 (25) 4 (71) 22 (63) 1 (41) 2 (98) SH2 domain
IPR001254 118 (26) 210 (3) 12 (73) 1 (41) 15 (85) Serine protease, trypsin family
IPR002126 114 (27) 19 (56) 16 (69) 0 (na) 0 (na) Cadherin domain
IPR000210 113 (28) 78 (21) 117 (13) 1 (41) 54 (50) BTB/POZ domain
IPR000387 112 (29) 35 (40) 108 (16) 12 (30) 21 (79) Tyrosine-speci®c protein phosphatase and dual speci®city
protein phosphatase family
IPR000087 106 (30) 18 (57) 169 (9) 0 (na) 5 (95) Collagen triple helix repeat
IPR000379 94 (31) 141 (8) 134 (11) 40 (10) 194 (11) Esterase/lipase/thioesterase
IPR000910 89 (32) 38 (38) 18 (67) 8 (34) 18 (82) HMG1/2 (high mobility group) box
IPR000130 87 (33) 56 (29) 92 (22) 8 (34) 12 (88) Neutral zinc metallopeptidase
IPR001965 84 (34) 37 (39) 24 (61) 16 (26) 71 (39) PHD-®nger
IPR000636 83 (35) 32 (43) 24 (61) 1 (41) 14 (86) Cation channels (non-ligand gated)
IPR001781 81 (36) 38 (38) 36 (49) 4 (38) 8 (92) LIM domain
IPR002035 81 (36) 8 (67) 45 (44) 3 (39) 17 (83) VWA domain
IPR001715 80 (37) 33 (42) 30 (55) 3 (39) 18 (82) Calponin homology domain
IPR000198 77 (38) 20 (55) 20 (65) 10 (32) 9 (91) RhoGAP domain
...................................................................................................................................................................................................................................................................................................................................................................
Forty most populous Interpro families found in the human proteome compared with equivalent numbers from other species. na, not applicable (used when there are no proteins in an organism in that family).
© 2001 Macmillan Magazines Ltd
the olfactory receptors in the draft genome sequence have disrupted
ORFs and appear to be pseudogenes, consistent with recent
reports
344,346
suggesting massive functional gene loss in the last 10
Myr
347,348
. Interestingly, there appears to be a much higher propor-
tion of intact genes among class I than class II olfactory receptors,
suggesting functional importance.
Vertebrates are not unique in employing gene family expansion.
For many domain types, expansions appear to have occurred
independently in each of the major eukaryotic lineages. A good
example is the classical C2H2 family of zinc ®nger domains, which
have expanded independently in the yeast, worm, ¯y and human
lineages (Fig. 45). These independent expansions have resulted in
numerous C2H2 zinc ®nger domain-containing proteins that are
speci®c to each lineage. In ¯ies, the important components of the
C2H2 zinc ®nger expansion are architectures in which it is com-
bined with the POZ domain and the C4DM domain (a metal-
binding domain found only in ¯y). In humans, the most prevalent
expansions are combinations of the C2H2 zinc ®nger with POZ
(independent of the one in insects) and the vertebrate-speci®c
KRAB and SCAN domains.
The homeodomain is similarly expanded in all animals and is
present in both architectures that are conserved and lineage-speci®c
architectures (Fig. 45). This indicates that the ancestral animal
probably encoded a signi®cant number of homeodomain proteins,
but subsequent evolution involved multiple, independent expan-
sions and domain shuf¯ing after lineages diverged. Thus, the most
prevalent transcription factor families are different in worm, ¯y and
human (Fig. 45). This has major biological implications because
transcription factors are critical in animal development and differ-
entiation. The emergence of major variations in the developmental
body plans that accompanied the early radiation of the animals
349
could have been driven by lineage-speci®c proliferation of such
transcription factors. Beyond these large expansions of protein
families, protein components of particular functional systems
such as the cell death signalling system show a general increase in
diversity and numbers in the vertebrates relative to other animals.
For example, there are greater numbers of and more novel archi-
tectures in cell death regulatory proteins such as BCL-2, TNFR and
NFkB from vertebrates.
Conclusion. Five lines of evidence point to an increase in the
complexity of the proteome from the single-celled yeast to the
multicellular invertebrates and to vertebrates such as the human.
Speci®cally, the human contains greater numbers of genes, domain
and protein families, paralogues, multidomain proteins with
multiple functions, and domain architectures. According to these
measures, the relatively greater complexity of the human proteome
is a consequence not simply of its larger size, but also of large-scale
protein innovation.
An important question is the extent to which the greater
phenotypic complexity of vertebrates can be explained simply by
two- or threefold increases in proteome complexity. The real
explanation may lie in combinatorial ampli®cation of these
modest differences, by mechanisms that include alternative splicing,
post-translational modi®cation and cellular regulatory networks.
The potential numbers of different proteins and protein±protein
interactions are vast, and their actual numbers cannot readily be
discerned from the genome sequence. Elucidating such system-
level properties presents one of the great challenges for modern
biology.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 907
Table 25 The most populous InterPro families in the human proteome and other species
Human Fly Worm Yeast Mustard weed
InterPro ID No. of
genes
Rank No. of
genes
Rank No. of
genes
Rank No. of
genes
Rank No. of
genes
Rank
IPR003006 765 (1) 140 (9) 64 (34) 0 (na) 0 (na) Immunoglobulin domain
PR000822 706 (2) 357 (1) 151 (10) 48 (7) 115 (20) C2H2 zinc ®nger
IPR000719 575 (3) 319 (2) 437 (2) 121 (1) 1049 (1) Eukaryotic protein kinase
IPR000276 569 (4) 97 (14) 358 (3) 0 (na) 16 (84) Rhodopsin-like GPCR superfamily
IPR001687 433 (5) 198 (4) 183 (7) 97 (2) 331 (5) P-loop motif
IPR000477 350 (6) 10 (65) 50 (41) 6 (36) 80 (35) Reverse transcriptase (RNA-dependent DNA polymerase)
IPR000504 300 (7) 157 (6) 96 (21) 54 (6) 255 (8) rrm domain
IPR001680 277 (8) 162 (5) 102 (19) 91 (3) 210 (10) G-protein b WD-40 repeats
IPR002110 276 (9) 105 (13) 107 (17) 19 (23) 120 (18) Ankyrin repeat
IPR001356 267 (10) 148 (7) 109 (15) 9 (33) 118 (19) Homeobox domain
IPR001849 252 (11) 77 (22) 71 (31) 27 (17) 27 (73) PH domain
IPR002048 242 (12) 111 (12) 81 (25) 15 (27) 167 (12) EF-hand family
IPR000561 222 (13) 81 (20) 113 (14) 0 (na) 17 (83) EGF-like domain
IPR001452 215 (14) 72 (23) 62 (35) 25 (18) 3 (97) SH3 domain
IPR001841 210 (15) 114 (11) 126 (12) 35 (12) 379 (4) RING ®nger
IPR001611 188 (16) 115 (10) 54 (38) 7 (35) 392 (2) Leucine-rich repeat
IPR001909 171 (17) 0 (na) 0 (na) 0 (na) 0 (na) KRAB box
IPR001777 165 (18) 63 (27) 51 (40) 2 (40) 4 (96) Fibronectin type III domain
IPR001478 162 (19) 70 (24) 66 (33) 2 (40) 15 (85) PDZ domain
IPR001650 155 (20) 87 (17) 78 (27) 79 (4) 148 (13) Helicase C-terminal domain
IPR001440 150 (21) 86 (18) 46 (43) 36 (11) 125 (17) TPR repeat
IPR002216 133 (22) 65 (26) 99 (20) 2 (40) 31 (69) Ion transport protein
IPR001092 131 (23) 84 (19) 41 (46) 7 (35) 106 (24) Helix±loop±helix DNA-binding domain
IPR000008 123 (24) 43 (34) 36 (49) 9 (33) 82 (34) C2 domain
IPR001664 119 (25) 4 (71) 22 (63) 1 (41) 2 (98) SH2 domain
IPR001254 118 (26) 210 (3) 12 (73) 1 (41) 15 (85) Serine protease, trypsin family
IPR002126 114 (27) 19 (56) 16 (69) 0 (na) 0 (na) Cadherin domain
IPR000210 113 (28) 78 (21) 117 (13) 1 (41) 54 (50) BTB/POZ domain
IPR000387 112 (29) 35 (40) 108 (16) 12 (30) 21 (79) Tyrosine-speci®c protein phosphatase and dual speci®city
protein phosphatase family
IPR000087 106 (30) 18 (57) 169 (9) 0 (na) 5 (95) Collagen triple helix repeat
IPR000379 94 (31) 141 (8) 134 (11) 40 (10) 194 (11) Esterase/lipase/thioesterase
IPR000910 89 (32) 38 (38) 18 (67) 8 (34) 18 (82) HMG1/2 (high mobility group) box
IPR000130 87 (33) 56 (29) 92 (22) 8 (34) 12 (88) Neutral zinc metallopeptidase
IPR001965 84 (34) 37 (39) 24 (61) 16 (26) 71 (39) PHD-®nger
IPR000636 83 (35) 32 (43) 24 (61) 1 (41) 14 (86) Cation channels (non-ligand gated)
IPR001781 81 (36) 38 (38) 36 (49) 4 (38) 8 (92) LIM domain
IPR002035 81 (36) 8 (67) 45 (44) 3 (39) 17 (83) VWA domain
IPR001715 80 (37) 33 (42) 30 (55) 3 (39) 18 (82) Calponin homology domain
IPR000198 77 (38) 20 (55) 20 (65) 10 (32) 9 (91) RhoGAP domain
...................................................................................................................................................................................................................................................................................................................................................................
Forty most populous Interpro families found in the human proteome compared with equivalent numbers from other species. na, not applicable (used when there are no proteins in an organism in that family).
© 2001 Macmillan Magazines Ltd
Page 49
Segmental history of the human genome
In bacteria, genomic segments often convey important information
about function: genes located close to one another often encode
proteins in a common pathway and are regulated in a common
operon. In mammals, genes found close to each other only rarely
have common functions, but they are still interesting because they
have a common history. In fact, the study of genomic segments can
shed light on biological events as long as 500 Myr ago and as recently
as 20,000 years ago.
Conserved segments between human and mouse
Humans and mice shared a common ancestor about 100 Myr ago.
Despite the 200 Myr of evolutionary distance between the species, a
signi®cant fraction of genes show synteny between the two, being
preserved within conserved segments. Genes tightly linked in one
mammalian species tend to be linked in others. In fact, conserved
segments have been observed in even more distant species: humans
show conserved segments with ®sh
350,351
and even with invertebrates
such as ¯y and worm
352
. In general, the likelihood that a syntenic
relationship will be disrupted correlates with the physical distance
between the loci and the evolutionary distance between the species.
Studying conserved segments between human and mouse has
several uses. First, conservation of gene order has been used to
identify likely orthologues between the species, particularly when
investigating disease phenotypes. Second, the study of conserved
segments among genomes helps us to deduce evolutionary ancestry.
And third, detailed comparative maps may assist in the assembly of
the mouse sequence, using the human sequence as a scaffold.
Two types of linkage conservation are commonly described
353
.
`Conserved synteny' indicates that at least two genes that reside on a
common chromosome in one species are also located on a common
chromosome in the other species. Syntenic loci are said to lie in a
`conserved segment' when not only the chromosomal position but
the linear order of the loci has been preserved, without interruption
by other chromosomal rearrangements.
An initial survey of homologous loci in human and mouse
354
suggested that the total number of conserved segments would be
about 180. Subsequent estimates based on increasingly detailed
comparative maps have remained close to this projection
353,355,356
(http://www.informatics.jax.org). The distribution of segment
lengths has corresponded reasonably well to the truncated negative
exponential curve predicted by the random breakage model
357
.
The availability of a draft human genome sequence allows the ®rst
global human±mouse comparison in which human physical dis-
tances can be measured in Mb, rather than cM or orthologous gene
counts. We identi®ed likely orthologues by reciprocal comparison
of the human and mouse mRNAs in the LocusLink database, using
megaBLAST. For each orthologous pair, we mapped the location of
the human gene in the draft genome sequence and then checked the
location of the mouse gene in the Mouse Genome Informatics
database (http://www.informatics.jax.org). Using a conservative
threshold, we identi®ed 3,920 orthologous pairs in which the
human gene could be mapped on the draft genome sequence with
high con®dence. Of these, 2,998 corresponding mouse genes had a
known position in the mouse genome. We then searched for
de®nitive conserved segments, de®ned as human regions containing
orthologues of at least two genes from the same mouse chromosome
region (, 15 cM) without interruption by segments from other
chromosomes.
We identi®ed 183 de®nitive conserved segments (Fig. 46). The
average segment length was 15.4 Mb, with the largest segment being
90.5 Mb and the smallest 24 kb. There were also 141 `singletons',
segments that contained only a single locus; these are not counted in
the statistics. Although some of these could be short conserved
segments, they could also re¯ect incorrect choices of orthologues or
problems with the human or mouse maps. Because of this con-
servative approach, the observed number of de®nitive segments is
likely be lower than the correct total. One piece of evidence for this
conclusion comes from a more detailed analysis on human chro-
mosome 7 (ref. 358), which identi®ed 20 conserved segments, of
which three were singletons. Our analysis revealed only 13 de®nitive
segments on this chromosome, with nine singletons.
The frequency of observing a particular gene count in a conserved
segment is plotted on a logarithmic scale in Fig. 47. If chromosomal
breaks occur in a random fashion (as has been proposed) and
differences in gene density are ignored, a roughly straight line
should result. There is a clear excess for n = 1, suggesting that 50%
or more of the singletons are indeed artefactual. Thus, we estimate
that true number of conserved segments is around 190±230, in good
agreement with the original Nadeau±Taylor prediction
354
.
Figure 48 shows a plot of the frequency of lengths of conserved
segments, where the x-axis scale is shown in Mb. As before, there is a
fair amount of scatter in the data for the larger segments (where the
numbers are small), but the trend appears to be consistent with a
random breakage model.
We attempted to ascertain whether the breakpoint regions have
any special characteristics. This analysis was complicated by impre-
cision in the positioning of these breaks, which will tend to blur any
relationships. With 2,998 orthologues, the average interval within
which a break is known to have occurred is about 1.1 Mb. We
compared the aggregate features of these breakpoint intervals with
the genome as a whole. The mean gene density was lower in
breakpoint regions than in the conserved segments (13.8 versus
articles
908 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
50
100
150
200
250
300
350
400
0 100 200 300 400 500 600 700 800 900
F
ly
(n
um
be
r o
f g
en
es
)
Human (number of genes)
6
10
28
2
3
5
1
0
4
18
8 7
17 15 12 9
13
14
11
32
37
36
34
35 33
16
31 24
25
30
27 26
212329 20
22 19
Figure 44 Relative expansions of protein families between human and ¯y. These data
have not been normalized for proteomic size differences. Blue line, equality between
normalized family sizes in the two organisms. Green line, equality between unnormalized
family sizes. Numbered InterPro entries: (1) immunoglobulin domain [IPR003006]; (2) zinc
®nger, C2H2 type [IPR000822]; (3) eukaryotic protein kinase [IPR000719]; (4) rhodopsin-
like GPCR superfamily [IPR000276]; (5) ATP/GTP-binding site motif A (P-loop)
[IPR001687]; (6) reverse transcriptase (RNA-dependent DNA polymerase) [IPR000477];
(7) RNA-binding region RNP-1 (RNA recognition motif) [IPR000504]; (8) G-proteinb WD-
40 repeats [IPR001680]; (9) ankyrin repeat [IPR002110]; (10) homeobox domain
[IPR001356]; (11) PH domain [IPR001849]; (12) EF-hand family [IPR002048]; (13) EGF-
like domain [IPR000561]; (14) Src homology 3 (SH3) domain [IPR001452]; (15) RING
®nger [IPR001841]; (16) KRAB box [IPR001909]; (17) leucine-rich repeat [IPR001611];
(18) ®bronectin type III domain [IPR001777]; (19) PDZ domain (also known as DHR or
GLGF) [IPR001478]; (20) TPR repeat [IPR001440]; (21) helicase C-terminal domain
[IPR001650]; (22) ion transport protein [IPR002216]; (23) helix±loop±helix DNA-binding
domain [IPR001092]; (24) cadherin domain [IPR002126]; (25) intermediate ®lament
proteins [IPR001664]; (26) C2 domain [IPR000008]; (27) Src homology 2 (SH2) domain
[IPR000980]; (28) serine proteases, trypsin family [IPR001254]; (29) BTB/POZ domain
[IPR000210]; (30) tyrosine-speci®c protein phosphatase and dual speci®city protein
phosphatase family [IPR000387]; (31) collagen triple helix repeat [IPR000087]; (32)
esterase/lipase/thioesterase [IPR000379]; (33) neutral zinc metallopeptidases, zinc-
binding region [IPR000130]; (34) ATP-binding transport protein, 2nd P-loop motif
[IPR001051]; (35) ABC transporters family [IPR001617]; (36) cytochrome P450 enzyme
[IPR001128]; (37) insect cuticle protein [IPR000618].
© 2001 Macmillan Magazines Ltd
In bacteria, genomic segments often convey important information
about function: genes located close to one another often encode
proteins in a common pathway and are regulated in a common
operon. In mammals, genes found close to each other only rarely
have common functions, but they are still interesting because they
have a common history. In fact, the study of genomic segments can
shed light on biological events as long as 500 Myr ago and as recently
as 20,000 years ago.
Conserved segments between human and mouse
Humans and mice shared a common ancestor about 100 Myr ago.
Despite the 200 Myr of evolutionary distance between the species, a
signi®cant fraction of genes show synteny between the two, being
preserved within conserved segments. Genes tightly linked in one
mammalian species tend to be linked in others. In fact, conserved
segments have been observed in even more distant species: humans
show conserved segments with ®sh
350,351
and even with invertebrates
such as ¯y and worm
352
. In general, the likelihood that a syntenic
relationship will be disrupted correlates with the physical distance
between the loci and the evolutionary distance between the species.
Studying conserved segments between human and mouse has
several uses. First, conservation of gene order has been used to
identify likely orthologues between the species, particularly when
investigating disease phenotypes. Second, the study of conserved
segments among genomes helps us to deduce evolutionary ancestry.
And third, detailed comparative maps may assist in the assembly of
the mouse sequence, using the human sequence as a scaffold.
Two types of linkage conservation are commonly described
353
.
`Conserved synteny' indicates that at least two genes that reside on a
common chromosome in one species are also located on a common
chromosome in the other species. Syntenic loci are said to lie in a
`conserved segment' when not only the chromosomal position but
the linear order of the loci has been preserved, without interruption
by other chromosomal rearrangements.
An initial survey of homologous loci in human and mouse
354
suggested that the total number of conserved segments would be
about 180. Subsequent estimates based on increasingly detailed
comparative maps have remained close to this projection
353,355,356
(http://www.informatics.jax.org). The distribution of segment
lengths has corresponded reasonably well to the truncated negative
exponential curve predicted by the random breakage model
357
.
The availability of a draft human genome sequence allows the ®rst
global human±mouse comparison in which human physical dis-
tances can be measured in Mb, rather than cM or orthologous gene
counts. We identi®ed likely orthologues by reciprocal comparison
of the human and mouse mRNAs in the LocusLink database, using
megaBLAST. For each orthologous pair, we mapped the location of
the human gene in the draft genome sequence and then checked the
location of the mouse gene in the Mouse Genome Informatics
database (http://www.informatics.jax.org). Using a conservative
threshold, we identi®ed 3,920 orthologous pairs in which the
human gene could be mapped on the draft genome sequence with
high con®dence. Of these, 2,998 corresponding mouse genes had a
known position in the mouse genome. We then searched for
de®nitive conserved segments, de®ned as human regions containing
orthologues of at least two genes from the same mouse chromosome
region (, 15 cM) without interruption by segments from other
chromosomes.
We identi®ed 183 de®nitive conserved segments (Fig. 46). The
average segment length was 15.4 Mb, with the largest segment being
90.5 Mb and the smallest 24 kb. There were also 141 `singletons',
segments that contained only a single locus; these are not counted in
the statistics. Although some of these could be short conserved
segments, they could also re¯ect incorrect choices of orthologues or
problems with the human or mouse maps. Because of this con-
servative approach, the observed number of de®nitive segments is
likely be lower than the correct total. One piece of evidence for this
conclusion comes from a more detailed analysis on human chro-
mosome 7 (ref. 358), which identi®ed 20 conserved segments, of
which three were singletons. Our analysis revealed only 13 de®nitive
segments on this chromosome, with nine singletons.
The frequency of observing a particular gene count in a conserved
segment is plotted on a logarithmic scale in Fig. 47. If chromosomal
breaks occur in a random fashion (as has been proposed) and
differences in gene density are ignored, a roughly straight line
should result. There is a clear excess for n = 1, suggesting that 50%
or more of the singletons are indeed artefactual. Thus, we estimate
that true number of conserved segments is around 190±230, in good
agreement with the original Nadeau±Taylor prediction
354
.
Figure 48 shows a plot of the frequency of lengths of conserved
segments, where the x-axis scale is shown in Mb. As before, there is a
fair amount of scatter in the data for the larger segments (where the
numbers are small), but the trend appears to be consistent with a
random breakage model.
We attempted to ascertain whether the breakpoint regions have
any special characteristics. This analysis was complicated by impre-
cision in the positioning of these breaks, which will tend to blur any
relationships. With 2,998 orthologues, the average interval within
which a break is known to have occurred is about 1.1 Mb. We
compared the aggregate features of these breakpoint intervals with
the genome as a whole. The mean gene density was lower in
breakpoint regions than in the conserved segments (13.8 versus
articles
908 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
50
100
150
200
250
300
350
400
0 100 200 300 400 500 600 700 800 900
F
ly
(n
um
be
r o
f g
en
es
)
Human (number of genes)
6
10
28
2
3
5
1
0
4
18
8 7
17 15 12 9
13
14
11
32
37
36
34
35 33
16
31 24
25
30
27 26
212329 20
22 19
Figure 44 Relative expansions of protein families between human and ¯y. These data
have not been normalized for proteomic size differences. Blue line, equality between
normalized family sizes in the two organisms. Green line, equality between unnormalized
family sizes. Numbered InterPro entries: (1) immunoglobulin domain [IPR003006]; (2) zinc
®nger, C2H2 type [IPR000822]; (3) eukaryotic protein kinase [IPR000719]; (4) rhodopsin-
like GPCR superfamily [IPR000276]; (5) ATP/GTP-binding site motif A (P-loop)
[IPR001687]; (6) reverse transcriptase (RNA-dependent DNA polymerase) [IPR000477];
(7) RNA-binding region RNP-1 (RNA recognition motif) [IPR000504]; (8) G-proteinb WD-
40 repeats [IPR001680]; (9) ankyrin repeat [IPR002110]; (10) homeobox domain
[IPR001356]; (11) PH domain [IPR001849]; (12) EF-hand family [IPR002048]; (13) EGF-
like domain [IPR000561]; (14) Src homology 3 (SH3) domain [IPR001452]; (15) RING
®nger [IPR001841]; (16) KRAB box [IPR001909]; (17) leucine-rich repeat [IPR001611];
(18) ®bronectin type III domain [IPR001777]; (19) PDZ domain (also known as DHR or
GLGF) [IPR001478]; (20) TPR repeat [IPR001440]; (21) helicase C-terminal domain
[IPR001650]; (22) ion transport protein [IPR002216]; (23) helix±loop±helix DNA-binding
domain [IPR001092]; (24) cadherin domain [IPR002126]; (25) intermediate ®lament
proteins [IPR001664]; (26) C2 domain [IPR000008]; (27) Src homology 2 (SH2) domain
[IPR000980]; (28) serine proteases, trypsin family [IPR001254]; (29) BTB/POZ domain
[IPR000210]; (30) tyrosine-speci®c protein phosphatase and dual speci®city protein
phosphatase family [IPR000387]; (31) collagen triple helix repeat [IPR000087]; (32)
esterase/lipase/thioesterase [IPR000379]; (33) neutral zinc metallopeptidases, zinc-
binding region [IPR000130]; (34) ATP-binding transport protein, 2nd P-loop motif
[IPR001051]; (35) ABC transporters family [IPR001617]; (36) cytochrome P450 enzyme
[IPR001128]; (37) insect cuticle protein [IPR000618].
© 2001 Macmillan Magazines Ltd
Page 51
The most extreme mechanism is whole-genome duplication
(WGD), through a polyploidization event in which a diploid
organism becomes tetraploid. Such events are classi®ed as autopo-
lyploidy or allopolyploidy, depending on whether they involve
hybridization between members of the same species or different
species. Polyploidization is common in the plant kingdom, with
many known examples among wild and domesticated crop species.
Alfalfa (Medicago sativa) is a naturally occurring autotetraploid
364
,
and Nicotiana tabacum, some species of cotton (Gossypium) and
several of the common brassicas are allotetraploids containing pairs
of `homeologous' chromosome pairs.
In principle, WGD provides the raw material for great bursts of
innovation by allowing the duplication and divergence of entire
pathways. Ohno
365
suggested that WGD has played a key role in
evolution. There is evidence for an ancient WGD event in the
ancestry of yeast and several independent such events in the ancestry
of mustard weed
366±369
. Such ancient WGD events can be hard to
detect because only a minority of the duplicated loci may be
retained, with the result that the genes in duplicated segments
cannot be aligned in a one-to-one correspondence but rather
require many gaps. In addition, duplicated segments may be
subsequently rearranged. For example, the ancient duplication in
the yeast genome appears to have been followed by loss of more than
90% of the newly duplicated genes
366
.
One of the most controversial hypotheses about vertebrate
evolution is the proposal that two WGD events occurred early in
the vertebrate lineage, around the time of jawed ®shes some 500 Myr
ago. Some authors
370±373
have seen support for this theory in the fact
that many human genes occur in sets of four homologuesÐmost
notably the four extensive HOX gene clusters on chromosomes 2, 7,
12 and 17, whose duplication dates to around the correct time.
However, other authors have disputed this interpretation
374
,
suggesting that these cases may re¯ect unrelated duplications of
speci®c regions rather than successive WGD.
We analysed the draft genome sequence for evidence that might
bear on this question. The analysis provides many interesting
observations, but no convincing evidence of ancient WGD. We
looked for evidence of pairs of chromosomal regions containing
many homologous genes. Although we found many pairs contain-
ing a few homologous genes, the human genome does not appear to
contain any pairs of regions where the density of duplicated genes
approaches the densities seen in yeast or mustard weed
366±369
.
We also examined human proteins in the IPI for which the
orthologues among ¯y or worm proteins occur in the ratios 2:1:1,
3:1:1, 4:1:1 and so on (Fig. 49). The number of such families falls
smoothly, with no peak at four and some instances of ®ve or more
homologues. Although this does not rule out two rounds of WGD
followed by extensive gene loss and some unrelated gene duplica-
tion, it provides no support for the theory. More probatively, if two
successive rounds of genome duplication occurred, phylogenetic
analysis of the proteins having 4:1:1 ratios between human, ¯y and
worm would be expected to show more trees with the topology
(A,B)(C,D) for the human sequences than (A,(B,(C,D)))
375
. How-
ever, of 57 sets studied carefully, only 24% of the trees constructed
from the 4:1:1 set have the former topology; this is not signi®cantly
different from what would be expected under the hypothesis of
random sequential duplication of individual loci.
articles
910 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 X Y
13 14 15 16 17 18 19 20 21 22 X Y
Figure 46 Conserved segments in the human and mouse genome. Human
chromosomes, with segments containing at least two genes whose order is conserved in
the mouse genome as colour blocks. Each colour corresponds to a particular mouse
chromosome. Centromeres, subcentromeric heterochromatin of chromosomes 1, 9 and
16, and the repetitive short arms of 13, 14, 15, 21 and 22 are in black.
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Conserved segment length (Mb)
O
cc
ur
re
nc
es
100
10
1
0
Figure 48 Distribution of lengths (in 5-Mb bins) of conserved segments between human
and mouse genomes, omitting singletons.
0 10 20 30 40 50 60 70 80 90 100
Genes per conserved segment
O
cc
ur
re
nc
es
100
10
1
0
Figure 47 Distribution of number of genes per conserved segment between human and
mouse genomes.
© 2001 Macmillan Magazines Ltd
(WGD), through a polyploidization event in which a diploid
organism becomes tetraploid. Such events are classi®ed as autopo-
lyploidy or allopolyploidy, depending on whether they involve
hybridization between members of the same species or different
species. Polyploidization is common in the plant kingdom, with
many known examples among wild and domesticated crop species.
Alfalfa (Medicago sativa) is a naturally occurring autotetraploid
364
,
and Nicotiana tabacum, some species of cotton (Gossypium) and
several of the common brassicas are allotetraploids containing pairs
of `homeologous' chromosome pairs.
In principle, WGD provides the raw material for great bursts of
innovation by allowing the duplication and divergence of entire
pathways. Ohno
365
suggested that WGD has played a key role in
evolution. There is evidence for an ancient WGD event in the
ancestry of yeast and several independent such events in the ancestry
of mustard weed
366±369
. Such ancient WGD events can be hard to
detect because only a minority of the duplicated loci may be
retained, with the result that the genes in duplicated segments
cannot be aligned in a one-to-one correspondence but rather
require many gaps. In addition, duplicated segments may be
subsequently rearranged. For example, the ancient duplication in
the yeast genome appears to have been followed by loss of more than
90% of the newly duplicated genes
366
.
One of the most controversial hypotheses about vertebrate
evolution is the proposal that two WGD events occurred early in
the vertebrate lineage, around the time of jawed ®shes some 500 Myr
ago. Some authors
370±373
have seen support for this theory in the fact
that many human genes occur in sets of four homologuesÐmost
notably the four extensive HOX gene clusters on chromosomes 2, 7,
12 and 17, whose duplication dates to around the correct time.
However, other authors have disputed this interpretation
374
,
suggesting that these cases may re¯ect unrelated duplications of
speci®c regions rather than successive WGD.
We analysed the draft genome sequence for evidence that might
bear on this question. The analysis provides many interesting
observations, but no convincing evidence of ancient WGD. We
looked for evidence of pairs of chromosomal regions containing
many homologous genes. Although we found many pairs contain-
ing a few homologous genes, the human genome does not appear to
contain any pairs of regions where the density of duplicated genes
approaches the densities seen in yeast or mustard weed
366±369
.
We also examined human proteins in the IPI for which the
orthologues among ¯y or worm proteins occur in the ratios 2:1:1,
3:1:1, 4:1:1 and so on (Fig. 49). The number of such families falls
smoothly, with no peak at four and some instances of ®ve or more
homologues. Although this does not rule out two rounds of WGD
followed by extensive gene loss and some unrelated gene duplica-
tion, it provides no support for the theory. More probatively, if two
successive rounds of genome duplication occurred, phylogenetic
analysis of the proteins having 4:1:1 ratios between human, ¯y and
worm would be expected to show more trees with the topology
(A,B)(C,D) for the human sequences than (A,(B,(C,D)))
375
. How-
ever, of 57 sets studied carefully, only 24% of the trees constructed
from the 4:1:1 set have the former topology; this is not signi®cantly
different from what would be expected under the hypothesis of
random sequential duplication of individual loci.
articles
910 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 X Y
13 14 15 16 17 18 19 20 21 22 X Y
Figure 46 Conserved segments in the human and mouse genome. Human
chromosomes, with segments containing at least two genes whose order is conserved in
the mouse genome as colour blocks. Each colour corresponds to a particular mouse
chromosome. Centromeres, subcentromeric heterochromatin of chromosomes 1, 9 and
16, and the repetitive short arms of 13, 14, 15, 21 and 22 are in black.
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Conserved segment length (Mb)
O
cc
ur
re
nc
es
100
10
1
0
Figure 48 Distribution of lengths (in 5-Mb bins) of conserved segments between human
and mouse genomes, omitting singletons.
0 10 20 30 40 50 60 70 80 90 100
Genes per conserved segment
O
cc
ur
re
nc
es
100
10
1
0
Figure 47 Distribution of number of genes per conserved segment between human and
mouse genomes.
© 2001 Macmillan Magazines Ltd
Page 53
include the DiGeorge/velocardiofacial syndrome region on chro-
mosome 22 (ref. 238) and the Williams±Beuren syndrome recur-
rent deletion on chromosome 7 (ref. 239).
The availability of the genome sequence also allows rapid identi-
®cation of paralogues of disease genes, which is valuable for two
reasons. First, mutations in a paralogous gene may give rise to a
related genetic disease. A good example, discovered through use of
the genome sequence, is achromatopsia (complete colour blind-
ness). The CNGA3 gene, encoding the a-subunit of the cone
photoreceptor cyclic GMP-gated channel, had been shown to
harbour mutations in some families with achromatopsia. Compu-
tational searching of the genome sequences revealed the paralogous
gene encoding the corresponding b-subunit, CNGB3 (which had
not been apparent from EST databases). The CNGB3 gene was
rapidly shown to be the cause of achromatopsia in other
families
407,408
. Another example is provided by the presenilin-1
and presenilin-2 genes, in which mutations can cause early-onset
Alzheimer's disease
423,424
. Second, the paralogue may provide an
opportunity for therapeutic intervention, as exempli®ed by
attempts to reactivate the fetally expressed haemoglobin genes in
individuals with sickle cell disease or b-thalassaemia, caused by
mutations in the b-globin gene
425
.
We undertook a systematic search for paralogues of 971 known
human disease genes with entries in both the Online Mendelian
Inheritance in Man (OMIM) database (http://www.ncbi.nlm.nih.
gov/Omim/) and either the SwissProt or TrEMBL protein databases.
We identi®ed 286 potential paralogues (with the requirement of a
match of at least 50 amino acids with identity greater than 70% but
less than 90% if on the same chromosome, and less than 95% if on a
different chromosome). Although this analysis may have identi®ed
some pseudogenes, 89% of the matches showed homology over
more than one exon in the new target sequence, suggesting that
many are functional. This analysis shows the potential for rapid
identi®cation of disease gene paralogues in silico.
Drug targets
Over the past century, the pharmaceutical industry has largely
depended upon a limited set of drug targets to develop new
therapies. A recent compendium
426,427
lists 483 drug targets as
accounting for virtually all drugs on the market. Knowing the
complete set of human genes and proteins will greatly expand the
search for suitable drug targets. Although only a minority of human
genes may be drug targets, it has been predicted that the number will
exceed several thousand, and this prospect has led to a massive
expansion of genomic research in pharmaceutical research and
development. A few examples will illustrate the point.
(1) The neurotransmitter serotonin (5-HT) mediates rapid excita-
tory responses through ligand-gated channels. The previously
identi®ed 5-HT
3A
receptor gene produces functional receptors,
but with a much smaller conductance than observed in vivo.
Cross-hybridization experiments and analysis of ESTs failed to
reveal any other homologues of the known receptor. Recently,
however, by searching the human draft genome sequence at low
stringency, a putative homologue was identi®ed within a PAC clone
from the long arm of chromosome 11 (ref. 428). The homologue
was shown to be expressed in the amygdala, caudate and hippo-
campus, and a full-length cDNA was subsequently obtained. The
gene, which codes for a serotonin receptor, was named 5-HT
3B
.
When assembled in a heterodimer with 5-HT
3A
, it was shown to
account for the large-conductance neuronal serotonin channel.
Given the central role of the serotonin pathway in mood disorders
and schizophrenia, the discovery of a major new therapeutic target
is of considerable interest.
(2) The contractile and in¯ammatory actions of the cysteinyl
leukotrienes, formerly known as the slow reacting substance of
anaphylaxis (SRS-A), are mediated through speci®c receptors. The
second such receptor, CysLT
2
, was identi®ed using the combination
of a rat EST and the human genome sequence. This led to the
cloning of a gene with 38% amino-acid identity to the only other
receptor that had previously been identi®ed
429
. This new receptor,
which shows high-af®nity binding to several leukotrienes, maps to a
region of chromosome 13 that is linked to atopic asthma. The gene
is expressed in airway smooth muscles and in the heart. As the
leukotriene pathway has been a signi®cant target for the develop-
ment of drugs against asthma, the discovery of a new receptor has
obvious and important consequences.
(3) Abundant deposition of b-amyloid in senile plaques is the
hallmark of Alzheimer's disease. b-Amyloid is generated by pro-
teolytic processing of the amyloid precursor protein (APP). One of
the enzymes involved is the b-site APP-cleaving enzyme (BACE),
which is a transmembrane aspartyl protease. Computational
searching of the public human draft genome sequence recently
identi®ed a new sequence homologous to BACE, encoding a protein
now named BACE2
430,431
. BACE2, which has 52% amino-acid
sequence identity to BACE, contains two active protease sites and
maps to the obligatory Down's syndrome region of chromosome 21,
as does APP. This raises the question of whether the extra copies of
both BACE2 and APP may contribute to accelerated deposition of
b-amyloid in the brains of Down's syndrome patients. The devel-
opment of antagonists to BACE and BACE2 represents a promising
approach to preventing Alzheimer's disease.
Given these examples, we undertook a systematic effort to
identify paralogues of the classic drug target proteins in the draft
genome sequence. The target list
427
was used to identify 603 entries
in the SwissProt database with unique accession numbers. These
were then searched against the current genome sequence database,
using the requirement that a match should have 70±100% identity
to at least 50 amino acids. Matches to named proteins were ignored,
as we assumed that these represented known homologues.
We found 18 putative novel paralogues (Table 27), including
apparent dopamine receptors, purinergic receptors and insulin-like
growth factor receptors. In six cases, the novel paralogue matches at
least one EST, adding con®dence that this search process can
identify novel functional genes. For the remaining 12 putative
paralogues without an EST match, all have long ORFs and all but
articles
912 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 26 Disease genes positionally cloned using the draft genome
sequence
Locus Disorder Reference(s)
BRCA2 Breast cancer susceptibility 55
AIRE Autoimmune polyglandular syndrome type 1 (APS1
or APECED)
389
PEX1 Peroxisome biogenesis disorder 390, 391
PDS Pendred syndrome 392
XLP X-linked lymphoproliferative disease 393
DFNA5 Nonsyndromic deafness 394
ATP2A2 Darier's disease 395
SEDL X-linked spondyloepiphyseal dysplasia tarda 396
WISP3 Progressive pseudorheumatoid dysplasia 397
CCM1 Cerebral cavernous malformations 398, 399
COL11A2/DFNA13 Nonsyndromic deafness 400
LGMD 2G Limb-girdle muscular dystrophy 401
EVC Ellis-Van Creveld syndrome, Weyer's acrodental
dysostosis
402
ACTN4 Familial focal segmental glomerulosclerosis 403
SCN1A Generalized epilepsy with febrile seizures plus type 2 404
AASS Familial hyperlysinaemia 405
NDRG1 Hereditary motor and sensory neuropathy-Lom 406
CNGB3 Total colour-blindness 407, 408
MUL Mulibrey nanism 409
USH1C Usher type 1C 410, 411
MYH9 May-Hegglin anomaly 412, 413
PRKAR1A Carney's complex 414
MYH9 Nonsyndromic hereditary deafness DFNA17 415
SCA10 Spinocerebellar ataxia type 10 416
OPA1 Optic atrophy 417
XLCSNB X-linked congenital stationary night blindness 418
FGF23 Hypophosphataemic rickets 419
GAN Giant axonal neuropathy 420
AAAS Triple-A syndrome 421
HSPG2 Schwartz-Jampel syndrome 422
.............................................................................................................................................................................
© 2001 Macmillan Magazines Ltd
mosome 22 (ref. 238) and the Williams±Beuren syndrome recur-
rent deletion on chromosome 7 (ref. 239).
The availability of the genome sequence also allows rapid identi-
®cation of paralogues of disease genes, which is valuable for two
reasons. First, mutations in a paralogous gene may give rise to a
related genetic disease. A good example, discovered through use of
the genome sequence, is achromatopsia (complete colour blind-
ness). The CNGA3 gene, encoding the a-subunit of the cone
photoreceptor cyclic GMP-gated channel, had been shown to
harbour mutations in some families with achromatopsia. Compu-
tational searching of the genome sequences revealed the paralogous
gene encoding the corresponding b-subunit, CNGB3 (which had
not been apparent from EST databases). The CNGB3 gene was
rapidly shown to be the cause of achromatopsia in other
families
407,408
. Another example is provided by the presenilin-1
and presenilin-2 genes, in which mutations can cause early-onset
Alzheimer's disease
423,424
. Second, the paralogue may provide an
opportunity for therapeutic intervention, as exempli®ed by
attempts to reactivate the fetally expressed haemoglobin genes in
individuals with sickle cell disease or b-thalassaemia, caused by
mutations in the b-globin gene
425
.
We undertook a systematic search for paralogues of 971 known
human disease genes with entries in both the Online Mendelian
Inheritance in Man (OMIM) database (http://www.ncbi.nlm.nih.
gov/Omim/) and either the SwissProt or TrEMBL protein databases.
We identi®ed 286 potential paralogues (with the requirement of a
match of at least 50 amino acids with identity greater than 70% but
less than 90% if on the same chromosome, and less than 95% if on a
different chromosome). Although this analysis may have identi®ed
some pseudogenes, 89% of the matches showed homology over
more than one exon in the new target sequence, suggesting that
many are functional. This analysis shows the potential for rapid
identi®cation of disease gene paralogues in silico.
Drug targets
Over the past century, the pharmaceutical industry has largely
depended upon a limited set of drug targets to develop new
therapies. A recent compendium
426,427
lists 483 drug targets as
accounting for virtually all drugs on the market. Knowing the
complete set of human genes and proteins will greatly expand the
search for suitable drug targets. Although only a minority of human
genes may be drug targets, it has been predicted that the number will
exceed several thousand, and this prospect has led to a massive
expansion of genomic research in pharmaceutical research and
development. A few examples will illustrate the point.
(1) The neurotransmitter serotonin (5-HT) mediates rapid excita-
tory responses through ligand-gated channels. The previously
identi®ed 5-HT
3A
receptor gene produces functional receptors,
but with a much smaller conductance than observed in vivo.
Cross-hybridization experiments and analysis of ESTs failed to
reveal any other homologues of the known receptor. Recently,
however, by searching the human draft genome sequence at low
stringency, a putative homologue was identi®ed within a PAC clone
from the long arm of chromosome 11 (ref. 428). The homologue
was shown to be expressed in the amygdala, caudate and hippo-
campus, and a full-length cDNA was subsequently obtained. The
gene, which codes for a serotonin receptor, was named 5-HT
3B
.
When assembled in a heterodimer with 5-HT
3A
, it was shown to
account for the large-conductance neuronal serotonin channel.
Given the central role of the serotonin pathway in mood disorders
and schizophrenia, the discovery of a major new therapeutic target
is of considerable interest.
(2) The contractile and in¯ammatory actions of the cysteinyl
leukotrienes, formerly known as the slow reacting substance of
anaphylaxis (SRS-A), are mediated through speci®c receptors. The
second such receptor, CysLT
2
, was identi®ed using the combination
of a rat EST and the human genome sequence. This led to the
cloning of a gene with 38% amino-acid identity to the only other
receptor that had previously been identi®ed
429
. This new receptor,
which shows high-af®nity binding to several leukotrienes, maps to a
region of chromosome 13 that is linked to atopic asthma. The gene
is expressed in airway smooth muscles and in the heart. As the
leukotriene pathway has been a signi®cant target for the develop-
ment of drugs against asthma, the discovery of a new receptor has
obvious and important consequences.
(3) Abundant deposition of b-amyloid in senile plaques is the
hallmark of Alzheimer's disease. b-Amyloid is generated by pro-
teolytic processing of the amyloid precursor protein (APP). One of
the enzymes involved is the b-site APP-cleaving enzyme (BACE),
which is a transmembrane aspartyl protease. Computational
searching of the public human draft genome sequence recently
identi®ed a new sequence homologous to BACE, encoding a protein
now named BACE2
430,431
. BACE2, which has 52% amino-acid
sequence identity to BACE, contains two active protease sites and
maps to the obligatory Down's syndrome region of chromosome 21,
as does APP. This raises the question of whether the extra copies of
both BACE2 and APP may contribute to accelerated deposition of
b-amyloid in the brains of Down's syndrome patients. The devel-
opment of antagonists to BACE and BACE2 represents a promising
approach to preventing Alzheimer's disease.
Given these examples, we undertook a systematic effort to
identify paralogues of the classic drug target proteins in the draft
genome sequence. The target list
427
was used to identify 603 entries
in the SwissProt database with unique accession numbers. These
were then searched against the current genome sequence database,
using the requirement that a match should have 70±100% identity
to at least 50 amino acids. Matches to named proteins were ignored,
as we assumed that these represented known homologues.
We found 18 putative novel paralogues (Table 27), including
apparent dopamine receptors, purinergic receptors and insulin-like
growth factor receptors. In six cases, the novel paralogue matches at
least one EST, adding con®dence that this search process can
identify novel functional genes. For the remaining 12 putative
paralogues without an EST match, all have long ORFs and all but
articles
912 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Table 26 Disease genes positionally cloned using the draft genome
sequence
Locus Disorder Reference(s)
BRCA2 Breast cancer susceptibility 55
AIRE Autoimmune polyglandular syndrome type 1 (APS1
or APECED)
389
PEX1 Peroxisome biogenesis disorder 390, 391
PDS Pendred syndrome 392
XLP X-linked lymphoproliferative disease 393
DFNA5 Nonsyndromic deafness 394
ATP2A2 Darier's disease 395
SEDL X-linked spondyloepiphyseal dysplasia tarda 396
WISP3 Progressive pseudorheumatoid dysplasia 397
CCM1 Cerebral cavernous malformations 398, 399
COL11A2/DFNA13 Nonsyndromic deafness 400
LGMD 2G Limb-girdle muscular dystrophy 401
EVC Ellis-Van Creveld syndrome, Weyer's acrodental
dysostosis
402
ACTN4 Familial focal segmental glomerulosclerosis 403
SCN1A Generalized epilepsy with febrile seizures plus type 2 404
AASS Familial hyperlysinaemia 405
NDRG1 Hereditary motor and sensory neuropathy-Lom 406
CNGB3 Total colour-blindness 407, 408
MUL Mulibrey nanism 409
USH1C Usher type 1C 410, 411
MYH9 May-Hegglin anomaly 412, 413
PRKAR1A Carney's complex 414
MYH9 Nonsyndromic hereditary deafness DFNA17 415
SCA10 Spinocerebellar ataxia type 10 416
OPA1 Optic atrophy 417
XLCSNB X-linked congenital stationary night blindness 418
FGF23 Hypophosphataemic rickets 419
GAN Giant axonal neuropathy 420
AAAS Triple-A syndrome 421
HSPG2 Schwartz-Jampel syndrome 422
.............................................................................................................................................................................
© 2001 Macmillan Magazines Ltd
Page 55
because of the availability of tissues from all developmental time
points. A challenge will be to de®ne the gene-speci®c patterns of
alternative splicing, which may affect half of human genes. Existing
collections of ESTs and cDNAs may allow identi®cation of the most
abundant of these isoforms, but systematic exploration of this
problem may require exhaustive analysis of cDNA libraries from
multiple tissues or perhaps high-throughput reverse transcription±
PCR studies. Deep understanding of gene function will probably
require knowledge of the structure, tissue distribution and abun-
dance of these alternative forms.
Large-scale identi®cation of regulatory regions
The one-dimensional script of the human genome, shared by
essentially all cells in all tissues, contains suf®cient information to
provide for differentiation of hundreds of different cell types, and
the ability to respond to a vast array of internal and external
in¯uences. Much of this plasticity results from the carefully orche-
strated symphony of transcriptional regulation. Although much has
been learned about the cis-acting regulatory motifs of some speci®c
genes, the regulatory signals for most genes remain uncharacterized.
Comparative genomics of multiple vertebrates offers the best hope
for large-scale identi®cation of such regulatory sites
440
. Previous
studies of sequence alignment of regulatory domains of ortho-
logous genes in multiple species has shown a remarkable
correlation between sequence conservation, dubbed `phylogenetic
footprints'
441
, and the presence of binding motifs for transcription
factors. This approach could be particularly powerful if combined
with expression array technologies that identify cohorts of genes
that are coordinately regulated, implicating a common set of cis-
acting regulatory sequences
442±445
. It will also be of considerable
interest to study epigenetic modi®cations such as cytosine methyla-
tion on a genome-wide scale, and to determine their biological
consequences
446,447
. Towards this end, a pilot Human Epigenome
Project has been launched
448,449
.
Sequencing of additional large genomes
More generally, comparative genomics allows biologists to peruse
evolution's laboratory notebookÐto identify conserved functional
features and recognize new innovations in speci®c lineages. Deter-
mination of the genome sequence of many organisms is very
desirable. Already, projects are underway to sequence the genomes
of the mouse, rat, zebra®sh and the puffer®shes T. nigroviridis and
Takifugu rubripes. Plans are also under consideration for sequencing
additional primates and other organisms that will help de®ne key
developments along the vertebrate and nonvertebrate lineages.
To realize the full promise of comparative genomics, however, it
needs to become simple and inexpensive to sequence the genome of
any organism. Sequencing costs have dropped 100-fold over the last
10 years, corresponding to a roughly twofold decrease every 18
months. This rate is similar to `Moore's law' concerning improve-
ments in semiconductor manufacture. In both sequencing and
semiconductors, such improvement does not happen automatically,
but requires aggressive technological innovation fuelled by major
investment. Improvements are needed to move current dideoxy
sequencing to smaller volumes and more rapid sequencing
times, based upon advances such as microchannel technology.
More revolutionary methods, such as mass spectrometry, single-
molecule sequencing and nanopore approaches
76
, have not yet
been fully developed, but hold great promise and deserve strong
encouragement.
Completing the catalogue of human variation
The human draft genome sequence has already allowed the identi-
®cation of more than 1.4 million SNPs, comprising a substantial
proportion of all common human variation. This program should
be extended to obtain a nearly complete catalogue of common
variants and to identify the common ancestral haplotypes present in
the population. In principle, these genetic tools should make it
possible to perform association studies and linkage disequilibrium
studies
376
to identify the genes that confer even relatively modest risk
for common diseases. Launching such an intense era of human
molecular epidemiology will also require major advances in the cost
ef®ciency of genotyping technology, in the collection of carefully
phenotyped patient cohorts and in statistical methods for relating
large-scale SNP data to disease phenotype.
From sequence to function
The scienti®c program outlined above focuses on how the genome
sequence can be mined for biological information. In addition, the
sequence will serve as a foundation for a broad range of functional
genomic tools to help biologists to probe function in a more
systematic manner. These will need to include improved techniques
and databases for the global analysis of: RNA and protein expres-
sion, protein localization, protein±protein interactions and chemi-
cal inhibition of pathways. New computational techniques will be
needed to use such information to model cellular circuitry. A full
discussion of these important directions is beyond the scope of this
paper.
Concluding thoughts
The Human Genome Project is but the latest increment in a
remarkable scienti®c program whose origins stretch back a hundred
years to the rediscovery of Mendel's laws and whose end is nowhere
in sight. In a sense, it provides a capstone for efforts in the past
century to discover genetic information and a foundation for efforts
in the coming century to understand it.
We ®nd it humbling to gaze upon the human sequence now
coming into focus. In principle, the string of genetic bits holds long-
sought secrets of human development, physiology and medicine. In
practice, our ability to transform such information into under-
standing remains woefully inadequate. This paper simply records
some initial observations and attempts to frame issues for future
study. Ful®lling the true promise of the Human Genome Project will
be the work of tens of thousands of scientists around the world, in
both academia and industry. It is for this reason that our highest
priority has been to ensure that genome data are available rapidly,
freely and without restriction.
The scienti®c work will have profound long-term consequences
for medicine, leading to the elucidation of the underlying molecular
mechanisms of disease and thereby facilitating the design in many
cases of rational diagnostics and therapeutics targeted at those
mechanisms. But the science is only part of the challenge. We
must also involve society at large in the work ahead. We must set
realistic expectations that the most important bene®ts will not be
reaped overnight. Moreover, understanding and wisdom will be
required to ensure that these bene®ts are implemented broadly and
equitably. To that end, serious attention must be paid to the many
ethical, legal and social implications (ELSI) raised by the accelerated
pace of genetic discovery. This paper has focused on the scienti®c
achievements of the human genome sequencing efforts. This is not
the place to engage in a lengthy discussion of the ELSI issues, which
have also been a major research focus of the Human Genome
Project, but these issues are of comparable importance and could
appropriately ®ll a paper of equal length.
Finally, it is has not escaped our notice that the more we learn
about the human genome, the more there is to explore.
`` We shall not cease from exploration. And the end of all our
exploring will be to arrive where we started, and know the place for
the ®rst time.''ÐT. S. Eliot
450
M
Received 7 December 2000; accepted 9 January 2001.
1. Correns, C. Untersuchungen u
È
ber die Xenien bei Zea mays. Berichte der Deutsche Botanische
Gesellschaft 17, 410±418 (1899).
2. De Vries, H. Sur la loie de disjonction des hybrides. Comptes Rendue Hebdemodaires, Acad. Sci. Paris
130, 845±847 (1900).
3. von Tschermack, E. Uber Ku
È
nstliche Kreuzung bei Pisum sativum. Berichte der Deutsche Botanische
Gesellschaft 18, 232±239. (1900).
4. Sanger, F. et al. Nucleotide sequence of bacteriophage F X174 DNA. Nature 265, 687±695 (1977).
5. Sanger, F. et al. The nucleotide sequence of bacteriophage FX174. J Mol Biol 125, 225±246 (1978).
articles
914 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
points. A challenge will be to de®ne the gene-speci®c patterns of
alternative splicing, which may affect half of human genes. Existing
collections of ESTs and cDNAs may allow identi®cation of the most
abundant of these isoforms, but systematic exploration of this
problem may require exhaustive analysis of cDNA libraries from
multiple tissues or perhaps high-throughput reverse transcription±
PCR studies. Deep understanding of gene function will probably
require knowledge of the structure, tissue distribution and abun-
dance of these alternative forms.
Large-scale identi®cation of regulatory regions
The one-dimensional script of the human genome, shared by
essentially all cells in all tissues, contains suf®cient information to
provide for differentiation of hundreds of different cell types, and
the ability to respond to a vast array of internal and external
in¯uences. Much of this plasticity results from the carefully orche-
strated symphony of transcriptional regulation. Although much has
been learned about the cis-acting regulatory motifs of some speci®c
genes, the regulatory signals for most genes remain uncharacterized.
Comparative genomics of multiple vertebrates offers the best hope
for large-scale identi®cation of such regulatory sites
440
. Previous
studies of sequence alignment of regulatory domains of ortho-
logous genes in multiple species has shown a remarkable
correlation between sequence conservation, dubbed `phylogenetic
footprints'
441
, and the presence of binding motifs for transcription
factors. This approach could be particularly powerful if combined
with expression array technologies that identify cohorts of genes
that are coordinately regulated, implicating a common set of cis-
acting regulatory sequences
442±445
. It will also be of considerable
interest to study epigenetic modi®cations such as cytosine methyla-
tion on a genome-wide scale, and to determine their biological
consequences
446,447
. Towards this end, a pilot Human Epigenome
Project has been launched
448,449
.
Sequencing of additional large genomes
More generally, comparative genomics allows biologists to peruse
evolution's laboratory notebookÐto identify conserved functional
features and recognize new innovations in speci®c lineages. Deter-
mination of the genome sequence of many organisms is very
desirable. Already, projects are underway to sequence the genomes
of the mouse, rat, zebra®sh and the puffer®shes T. nigroviridis and
Takifugu rubripes. Plans are also under consideration for sequencing
additional primates and other organisms that will help de®ne key
developments along the vertebrate and nonvertebrate lineages.
To realize the full promise of comparative genomics, however, it
needs to become simple and inexpensive to sequence the genome of
any organism. Sequencing costs have dropped 100-fold over the last
10 years, corresponding to a roughly twofold decrease every 18
months. This rate is similar to `Moore's law' concerning improve-
ments in semiconductor manufacture. In both sequencing and
semiconductors, such improvement does not happen automatically,
but requires aggressive technological innovation fuelled by major
investment. Improvements are needed to move current dideoxy
sequencing to smaller volumes and more rapid sequencing
times, based upon advances such as microchannel technology.
More revolutionary methods, such as mass spectrometry, single-
molecule sequencing and nanopore approaches
76
, have not yet
been fully developed, but hold great promise and deserve strong
encouragement.
Completing the catalogue of human variation
The human draft genome sequence has already allowed the identi-
®cation of more than 1.4 million SNPs, comprising a substantial
proportion of all common human variation. This program should
be extended to obtain a nearly complete catalogue of common
variants and to identify the common ancestral haplotypes present in
the population. In principle, these genetic tools should make it
possible to perform association studies and linkage disequilibrium
studies
376
to identify the genes that confer even relatively modest risk
for common diseases. Launching such an intense era of human
molecular epidemiology will also require major advances in the cost
ef®ciency of genotyping technology, in the collection of carefully
phenotyped patient cohorts and in statistical methods for relating
large-scale SNP data to disease phenotype.
From sequence to function
The scienti®c program outlined above focuses on how the genome
sequence can be mined for biological information. In addition, the
sequence will serve as a foundation for a broad range of functional
genomic tools to help biologists to probe function in a more
systematic manner. These will need to include improved techniques
and databases for the global analysis of: RNA and protein expres-
sion, protein localization, protein±protein interactions and chemi-
cal inhibition of pathways. New computational techniques will be
needed to use such information to model cellular circuitry. A full
discussion of these important directions is beyond the scope of this
paper.
Concluding thoughts
The Human Genome Project is but the latest increment in a
remarkable scienti®c program whose origins stretch back a hundred
years to the rediscovery of Mendel's laws and whose end is nowhere
in sight. In a sense, it provides a capstone for efforts in the past
century to discover genetic information and a foundation for efforts
in the coming century to understand it.
We ®nd it humbling to gaze upon the human sequence now
coming into focus. In principle, the string of genetic bits holds long-
sought secrets of human development, physiology and medicine. In
practice, our ability to transform such information into under-
standing remains woefully inadequate. This paper simply records
some initial observations and attempts to frame issues for future
study. Ful®lling the true promise of the Human Genome Project will
be the work of tens of thousands of scientists around the world, in
both academia and industry. It is for this reason that our highest
priority has been to ensure that genome data are available rapidly,
freely and without restriction.
The scienti®c work will have profound long-term consequences
for medicine, leading to the elucidation of the underlying molecular
mechanisms of disease and thereby facilitating the design in many
cases of rational diagnostics and therapeutics targeted at those
mechanisms. But the science is only part of the challenge. We
must also involve society at large in the work ahead. We must set
realistic expectations that the most important bene®ts will not be
reaped overnight. Moreover, understanding and wisdom will be
required to ensure that these bene®ts are implemented broadly and
equitably. To that end, serious attention must be paid to the many
ethical, legal and social implications (ELSI) raised by the accelerated
pace of genetic discovery. This paper has focused on the scienti®c
achievements of the human genome sequencing efforts. This is not
the place to engage in a lengthy discussion of the ELSI issues, which
have also been a major research focus of the Human Genome
Project, but these issues are of comparable importance and could
appropriately ®ll a paper of equal length.
Finally, it is has not escaped our notice that the more we learn
about the human genome, the more there is to explore.
`` We shall not cease from exploration. And the end of all our
exploring will be to arrive where we started, and know the place for
the ®rst time.''ÐT. S. Eliot
450
M
Received 7 December 2000; accepted 9 January 2001.
1. Correns, C. Untersuchungen u
È
ber die Xenien bei Zea mays. Berichte der Deutsche Botanische
Gesellschaft 17, 410±418 (1899).
2. De Vries, H. Sur la loie de disjonction des hybrides. Comptes Rendue Hebdemodaires, Acad. Sci. Paris
130, 845±847 (1900).
3. von Tschermack, E. Uber Ku
È
nstliche Kreuzung bei Pisum sativum. Berichte der Deutsche Botanische
Gesellschaft 18, 232±239. (1900).
4. Sanger, F. et al. Nucleotide sequence of bacteriophage F X174 DNA. Nature 265, 687±695 (1977).
5. Sanger, F. et al. The nucleotide sequence of bacteriophage FX174. J Mol Biol 125, 225±246 (1978).
articles
914 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
Page 56
6. Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F. & Petersen, G. B. Nucleotide-sequence of
bacteriophage Lambda DNA. J. Mol. Biol. 162, 729±773 (1982).
7. Fiers, W. et al. Complete nucleotide sequence of SV40 DNA. Nature 273, 113±120 (1978).
8. Anderson, S. et al. Sequence and organization of the human mitochondrial genome. Nature 290,
457±465 (1981).
9. Botstein, D., White, R. L., Skolnick, M. & Davis, R. W. Construction of a genetic linkage map in man
using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32, 314±331 (1980).
10. Olson, M. V. et al. Random-clone strategy for genomic restriction mapping in yeast. Proc. Natl Acad.
Sci. USA 83, 7826±7830 (1986).
11. Coulson, A., Sulston, J., Brenner, S. & Karn, J. Toward a physical map of the genome of the nematode
Caenorhabditis elegans. Proc. Natl Acad. Sci. USA 83, 7821±7825 (1986).
12. Putney, S. D., Herlihy, W. C. & Schimmel, P. A new troponin T and cDNA clones for 13 different
muscle proteins, found by shotgun sequencing. Nature 302, 718±721 (1983).
13. Milner, R. J. & Sutcliffe, J. G. Gene expression in rat brain. Nucleic Acids Res. 11, 5497±5520 (1983).
14. Adams, M. D. et al. Complementary DNA sequencing: expressed sequence tags and human genome
project. Science 252, 1651±1656 (1991).
15. Adams, M. D. et al. Initial assessment of human gene diversity and expression patterns based upon
83 million nucleotides of cDNA sequence. Nature 377, 3±174 (1995).
16. Okubo, K. et al. Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of
gene expression. Nature Genet. 2, 173±179 (1992).
17. Hillier, L. D. et al. Generation and analysis of 280,000 human expressed sequence tags. Genome Res.
6, 807±828 (1996).
18. Strausberg, R. L., Feingold, E. A., Klausner, R. D. & Collins, F. S. The mammalian gene collection.
Science 286, 455±457 (1999).
19. Berry, R. et al. Gene-based sequence-tagged-sites (STSs) as the basis for a human gene map. Nature
Genet. 10, 415±423 (1995).
20. Houlgatte, R. et al. The Genexpress Index: a resource for gene discovery and the genic map of the
human genome. Genome Res. 5, 272±304 (1995).
21. Sinsheimer, R. L. The Santa Cruz WorkshopÐMay 1985. Genomics 5, 954±956 (1989).
22. Palca, J. Human genomeÐDepartment of Energy on the map. Nature 321, 371 (1986).
23. National Research Council Mapping and Sequencing the Human Genome (National Academy Press,
Washington DC, 1988).
24. Bishop, J. E. & Waldholz, M. Genome (Simon and Schuster, New York, 1990).
25. Kevles, D. J. & Hood, L. (eds) The Code of Codes: Scienti®c and Social Issues in the Human Genome
Project (Harvard Univ. Press, Cambridge, Massachusetts, 1992).
26. Cook-Deegan, R. The Gene Wars: Science, Politics, and the Human Genome (W. W. Norton & Co.,
New York, London, 1994).
27. Donis-Keller, H. et al. A genetic linkage map of the human genome. Cell 51, 319±337 (1987).
28. Gyapay, G. et al. The 1993±94 Genethon human genetic linkage map. Nature Genet. 7, 246±339
(1994).
29. Hudson, T. J. et al. An STS-based map of the human genome. Science 270, 1945±1954 (1995).
30. Dietrich, W. F. et al. A comprehensive genetic map of the mouse genome. Nature 380, 149±152
(1996).
31. Nusbaum, C. et al. A YAC-based physical map of the mouse genome. Nature Genet. 22, 388±393
(1999).
32. Oliver, S. G. et al. The complete DNA sequence of yeast chromosome III. Nature 357, 38±46 (1992).
33. Wilson, R. et al. 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans.
Nature 368, 32±38 (1994).
34. Chen, E. Y. et al. The human growth hormone locus: nucleotide sequence, biology, and evolution.
Genomics 4, 479±497 (1989).
35. McCombie, W. R. et al. Expressed genes, Alu repeats and polymorphisms in cosmids sequenced
from chromosome 4p16.3. Nature Genet. 1, 348±353 (1992).
36. Martin-Gallardo, A. et al. Automated DNA sequencing and analysis of 106 kilobases from human
chromosome 19q13.3. Nature Genet. 1, 34±39 (1992).
37. Edwards, A. et al. Automated DNA sequencing of the human HPRT locus. Genomics 6, 593±608
(1990).
38. Marshall, E. A strategy for sequencing the genome 5 years early. Science 267, 783±784 (1995).
39. Project to sequence human genome moves on to the starting blocks. Nature 375, 93±94 (1995).
40. Shizuya, H. et al. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in
Escherichia coli using an F-factor-based vector. Proc. Natl Acad. Sci. USA 89, 8794±8797 (1992).
41. Burke, D. T., Carle, G. F. & Olson, M. V. Cloning of large segments of exogenous DNA into yeast by
means of arti®cial chromosome vectors. Science 236, 806±812 (1987).
42. Marshall, E. A second private genome project. Science 281, 1121 (1998).
43. Marshall, E. NIH to produce a `working draft' of the genome by 2001. Science 281, 1774±1775
(1998).
44. Pennisi, E. Academic sequencers challenge Celera in a sprint to the ®nish. Science 283, 1822±1823
(1999).
45. Bouck, J., Miller, W., Gorrell, J. H., Muzny, D. & Gibbs, R. A. Analysis of the quality and utility of
random shotgun sequencing at low redundancies. Genome Res. 8, 1074±1084 (1998).
46. Collins, F. S. et al. New goals for the U. S. Human Genome Project: 1998±2003. Science 282, 682±689
(1998).
47. Sanger, F. & Coulson, A. R. A rapid method for determining sequences in DNA by primed synthesis
with DNA polymerase. J. Mol. Biol. 94, 441±448 (1975).
48. Maxam, A. M. & Gilbert, W. A new method for sequencing DNA. Proc. Natl Acad. Sci. USA 74, 560±
564 (1977).
49. Anderson, S. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids
Res. 9, 3015±3027 (1981).
50. Gardner, R. C. et al. The complete nucleotide sequence of an infectious clone of cauli¯ower mosaic
virus by M13mp7 shotgun sequencing. Nucleic Acids Res. 9, 2871±2888 (1981).
51. Deininger, P. L. Random subcloning of sonicated DNA: application to shotgun DNA sequence
analysis. Anal. Biochem. 129, 216±223 (1983).
52. Chissoe, S. L. et al. Sequence and analysis of the human ABL gene, the BCR gene, and regions
involved in the Philadelphia chromosomal translocation. Genomics 27, 67±82 (1995).
53. Rowen, L., Koop, B. F. & Hood, L. The complete 685-kilobase DNA sequence of the human beta T
cell receptor locus. Science 272, 1755±1762 (1996).
54. Koop, B. F. et al. Organization, structure, and function of 95 kb of DNA spanning the murine T-cell
receptor C alpha/C delta region. Genomics 13, 1209±1230 (1992).
55. Wooster, R. et al. Identi®cation of the breast cancer susceptibility gene BRCA2. Nature 378, 789±792
(1995).
56. Fleischmann, R. D. et al. Whole-genome random sequencing and assembly of Haemophilus
in¯uenzae Rd. Science 269, 496±512 (1995).
57. Lander, E. S. & Waterman, M. S. Genomic mapping by ®ngerprinting random clones: a
mathematical analysis. Genomics 2, 231±239 (1988).
58. Weber, J. L. & Myers, E. W. Human whole-genome shotgun sequencing. Genome Res. 7, 401±409
(1997).
59. Green, P. Against a whole-genome shotgun. Genome Res. 7, 410±417 (1997).
60. Venter, J. C. et al. Shotgun sequencing of the human genome. Science 280, 1540±1542 (1998).
61. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304±1351 (2001).
62. Smith, L. M. et al. Fluorescence detection in automated DNA sequence analysis. Nature 321, 674±
679 (1986).
63. Ju, J. Y., Ruan, C. C., Fuller, C. W., Glazer, A. N. & Mathies, R. A. Fluorescence energy-transfer dye-
labeled primers for DNA sequencing and analysis. Proc. Natl Acad. Sci. USA 92, 4347±4351 (1995).
64. Lee, L. G. et al. New energy transfer dyes for DNA sequencing. Nucleic Acids Res. 25, 2816±2822 (1997).
65. Rosenblum, B. B. et al. New dye-labeled terminators for improved DNA sequencing patterns.
Nucleic Acids Res. 25, 4500±4504 (1997).
66. Metzker, M. L., Lu, J. & Gibbs, R. A. Electrophoretically uniform ¯uorescent dyes for automated
DNA sequencing. Science 271, 1420±1422 (1996).
67. Prober, J. M. et al. A system for rapid DNA sequencing with ¯uorescent chain-terminating
dideoxynucleotides. Science 238, 336±341 (1987).
68. Reeve, M. A. & Fuller, C. W. A novel thermostable polymerase for DNA sequencing. Nature 376,
796±797 (1995).
69. Tabor, S. & Richardson, C. C. Selective inactivation of the exonuclease activity of bacteriophage T7
DNA polymerase by in vitro mutagenesis. J. Biol. Chem. 264, 6447±6458 (1989).
70. Tabor, S. & Richardson, C. C. DNA sequence analysis with a modi®ed bacteriophage T7 DNA
polymeraseÐeffect of pyrophosphorolysis and metal ions. J. Biol. Chem. 265, 8322±8328 (1990).
71. Murray, V. Improved double-stranded DNA sequencing using the linear polymerase chain reaction.
Nucleic Acids Res. 17, 8889 (1989).
72. Guttman, A., Cohen, A. S., Heiger, D. N. & Karger, B. L. Analytical and micropreparative ultrahigh
resolution of oligonucleotides by polyacrylamide-gel high-performance capillary electrophoresis.
Anal. Chem. 62, 137±141 (1990).
73. Luckey, J. A. et al. High-speed DNA sequencing by capillary electrophoresis. Nucleic Acids Res. 18,
4417±4421 (1990).
74. Swerdlow, H., Wu, S., Harke, H. & Dovichi, N. J. Capillary gel-electrophoresis for DNA sequencingÐ
laser-induced ¯uorescence detection with the sheath ¯ow cuvette. J. Chromatogr. 516, 61±67 (1990).
75. Meldrum, D. Automation for genomics, part one: preparation for sequencing. Genome Res. 10,
1081±1092 (2000).
76. Meldrum, D. Automation for genomics, part two: sequencers, microarrays, and future trends.
Genome Res. 10, 1288±1303 (2000).
77. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome Res. 8, 186±194 (1998).
78. Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using
phred. I. Accuracy assessment. Genome Res. 8, 175±185 (1998).
79. Bentley, D. R. Genomic sequence information should be released immediately and freely in the
public domain. Science 274, 533±534 (1996).
80. Guyer, M. Statement on the rapid release of genomic DNA sequence. Genome Res. 8, 413 (1998).
81. Dietrich, W. et al. A genetic map of the mouse suitable for typing intraspeci®c crosses. Genetics 131,
423±447 (1992).
82. Kim, U. J. et al. Construction and characterization of a human bacterial arti®cial chromosome
library. Genomics 34, 213±218 (1996).
83. Osoegawa, K. et al. Bacterial arti®cial chromosome libraries for mouse sequencing and functional
analysis. Genome Res. 10, 116±128 (2000).
84. Marra, M. A. et al. High throughput ®ngerprint analysis of large-insert clones. Genome Res. 7, 1072±
1084 (1997).
85. Marra, M. et al. A map for sequence analysis of the Arabidopsis thaliana genome. Nature Genet. 22,
265±270 (1999).
86. The International Human Genome Mapping Consortium. A physical map of the human genome.
Nature 409, 934±941 (2001).
87. Zhao, S. et al. Human BAC ends quality assessment and sequence analyses. Genomics 63, 321±332
(2000).
88. Mahairas, G. G. et al. Sequence-tagged connectors: A sequence approach to mapping and scanning
the human genome. Proc. Natl Acad. Sci. USA 96, 9739±9744 (1999).
89. Tilford, C. A. et al. A physical map of the human Y chromosome. Nature 409, 943±945 (2001).
90. Bentley, D. R. et al. The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X.
Nature 409, 942±943 (2001).
91. Montgomery, K. T. et al. A high-resolution map of human chromosome 12. Nature 409, 945±946
(2001).
92. Bru
È
ls, T. et al. A physical map of human chromosome 14. Nature 409, 947±948 (2001).
93. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311±319 (2000).
94. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489±495 (1999).
95. Cox, D. et al. Radiation hybrid map of the human genome. Science (in the press).
96. Osoegawa, K. et al. An improved approach for construction of bacterial arti®cial chromosome
libraries. Genomics 52, 1±8 (1998).
97. The International SNP Map Working Group. A map of human genome sequence variation
containing 1.42 million single nucleotide polymorphisms. Nature 409, 928±933 (2001).
98. Collins, F. S., Brooks, L. D. & Chakravarti, A. A DNA polymorphism discovery resource for research
on human genetic variation. Genome Res. 8, 1229±1231 (1998).
99. Stewart, E. A. et al. An STS-based radiation hybrid map of the human genome. Genome Res. 7, 422±
433 (1997).
100. Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744±746 (1998).
101. Dib, C. et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 915© 2001 Macmillan Magazines Ltd
bacteriophage Lambda DNA. J. Mol. Biol. 162, 729±773 (1982).
7. Fiers, W. et al. Complete nucleotide sequence of SV40 DNA. Nature 273, 113±120 (1978).
8. Anderson, S. et al. Sequence and organization of the human mitochondrial genome. Nature 290,
457±465 (1981).
9. Botstein, D., White, R. L., Skolnick, M. & Davis, R. W. Construction of a genetic linkage map in man
using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32, 314±331 (1980).
10. Olson, M. V. et al. Random-clone strategy for genomic restriction mapping in yeast. Proc. Natl Acad.
Sci. USA 83, 7826±7830 (1986).
11. Coulson, A., Sulston, J., Brenner, S. & Karn, J. Toward a physical map of the genome of the nematode
Caenorhabditis elegans. Proc. Natl Acad. Sci. USA 83, 7821±7825 (1986).
12. Putney, S. D., Herlihy, W. C. & Schimmel, P. A new troponin T and cDNA clones for 13 different
muscle proteins, found by shotgun sequencing. Nature 302, 718±721 (1983).
13. Milner, R. J. & Sutcliffe, J. G. Gene expression in rat brain. Nucleic Acids Res. 11, 5497±5520 (1983).
14. Adams, M. D. et al. Complementary DNA sequencing: expressed sequence tags and human genome
project. Science 252, 1651±1656 (1991).
15. Adams, M. D. et al. Initial assessment of human gene diversity and expression patterns based upon
83 million nucleotides of cDNA sequence. Nature 377, 3±174 (1995).
16. Okubo, K. et al. Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of
gene expression. Nature Genet. 2, 173±179 (1992).
17. Hillier, L. D. et al. Generation and analysis of 280,000 human expressed sequence tags. Genome Res.
6, 807±828 (1996).
18. Strausberg, R. L., Feingold, E. A., Klausner, R. D. & Collins, F. S. The mammalian gene collection.
Science 286, 455±457 (1999).
19. Berry, R. et al. Gene-based sequence-tagged-sites (STSs) as the basis for a human gene map. Nature
Genet. 10, 415±423 (1995).
20. Houlgatte, R. et al. The Genexpress Index: a resource for gene discovery and the genic map of the
human genome. Genome Res. 5, 272±304 (1995).
21. Sinsheimer, R. L. The Santa Cruz WorkshopÐMay 1985. Genomics 5, 954±956 (1989).
22. Palca, J. Human genomeÐDepartment of Energy on the map. Nature 321, 371 (1986).
23. National Research Council Mapping and Sequencing the Human Genome (National Academy Press,
Washington DC, 1988).
24. Bishop, J. E. & Waldholz, M. Genome (Simon and Schuster, New York, 1990).
25. Kevles, D. J. & Hood, L. (eds) The Code of Codes: Scienti®c and Social Issues in the Human Genome
Project (Harvard Univ. Press, Cambridge, Massachusetts, 1992).
26. Cook-Deegan, R. The Gene Wars: Science, Politics, and the Human Genome (W. W. Norton & Co.,
New York, London, 1994).
27. Donis-Keller, H. et al. A genetic linkage map of the human genome. Cell 51, 319±337 (1987).
28. Gyapay, G. et al. The 1993±94 Genethon human genetic linkage map. Nature Genet. 7, 246±339
(1994).
29. Hudson, T. J. et al. An STS-based map of the human genome. Science 270, 1945±1954 (1995).
30. Dietrich, W. F. et al. A comprehensive genetic map of the mouse genome. Nature 380, 149±152
(1996).
31. Nusbaum, C. et al. A YAC-based physical map of the mouse genome. Nature Genet. 22, 388±393
(1999).
32. Oliver, S. G. et al. The complete DNA sequence of yeast chromosome III. Nature 357, 38±46 (1992).
33. Wilson, R. et al. 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans.
Nature 368, 32±38 (1994).
34. Chen, E. Y. et al. The human growth hormone locus: nucleotide sequence, biology, and evolution.
Genomics 4, 479±497 (1989).
35. McCombie, W. R. et al. Expressed genes, Alu repeats and polymorphisms in cosmids sequenced
from chromosome 4p16.3. Nature Genet. 1, 348±353 (1992).
36. Martin-Gallardo, A. et al. Automated DNA sequencing and analysis of 106 kilobases from human
chromosome 19q13.3. Nature Genet. 1, 34±39 (1992).
37. Edwards, A. et al. Automated DNA sequencing of the human HPRT locus. Genomics 6, 593±608
(1990).
38. Marshall, E. A strategy for sequencing the genome 5 years early. Science 267, 783±784 (1995).
39. Project to sequence human genome moves on to the starting blocks. Nature 375, 93±94 (1995).
40. Shizuya, H. et al. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in
Escherichia coli using an F-factor-based vector. Proc. Natl Acad. Sci. USA 89, 8794±8797 (1992).
41. Burke, D. T., Carle, G. F. & Olson, M. V. Cloning of large segments of exogenous DNA into yeast by
means of arti®cial chromosome vectors. Science 236, 806±812 (1987).
42. Marshall, E. A second private genome project. Science 281, 1121 (1998).
43. Marshall, E. NIH to produce a `working draft' of the genome by 2001. Science 281, 1774±1775
(1998).
44. Pennisi, E. Academic sequencers challenge Celera in a sprint to the ®nish. Science 283, 1822±1823
(1999).
45. Bouck, J., Miller, W., Gorrell, J. H., Muzny, D. & Gibbs, R. A. Analysis of the quality and utility of
random shotgun sequencing at low redundancies. Genome Res. 8, 1074±1084 (1998).
46. Collins, F. S. et al. New goals for the U. S. Human Genome Project: 1998±2003. Science 282, 682±689
(1998).
47. Sanger, F. & Coulson, A. R. A rapid method for determining sequences in DNA by primed synthesis
with DNA polymerase. J. Mol. Biol. 94, 441±448 (1975).
48. Maxam, A. M. & Gilbert, W. A new method for sequencing DNA. Proc. Natl Acad. Sci. USA 74, 560±
564 (1977).
49. Anderson, S. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids
Res. 9, 3015±3027 (1981).
50. Gardner, R. C. et al. The complete nucleotide sequence of an infectious clone of cauli¯ower mosaic
virus by M13mp7 shotgun sequencing. Nucleic Acids Res. 9, 2871±2888 (1981).
51. Deininger, P. L. Random subcloning of sonicated DNA: application to shotgun DNA sequence
analysis. Anal. Biochem. 129, 216±223 (1983).
52. Chissoe, S. L. et al. Sequence and analysis of the human ABL gene, the BCR gene, and regions
involved in the Philadelphia chromosomal translocation. Genomics 27, 67±82 (1995).
53. Rowen, L., Koop, B. F. & Hood, L. The complete 685-kilobase DNA sequence of the human beta T
cell receptor locus. Science 272, 1755±1762 (1996).
54. Koop, B. F. et al. Organization, structure, and function of 95 kb of DNA spanning the murine T-cell
receptor C alpha/C delta region. Genomics 13, 1209±1230 (1992).
55. Wooster, R. et al. Identi®cation of the breast cancer susceptibility gene BRCA2. Nature 378, 789±792
(1995).
56. Fleischmann, R. D. et al. Whole-genome random sequencing and assembly of Haemophilus
in¯uenzae Rd. Science 269, 496±512 (1995).
57. Lander, E. S. & Waterman, M. S. Genomic mapping by ®ngerprinting random clones: a
mathematical analysis. Genomics 2, 231±239 (1988).
58. Weber, J. L. & Myers, E. W. Human whole-genome shotgun sequencing. Genome Res. 7, 401±409
(1997).
59. Green, P. Against a whole-genome shotgun. Genome Res. 7, 410±417 (1997).
60. Venter, J. C. et al. Shotgun sequencing of the human genome. Science 280, 1540±1542 (1998).
61. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304±1351 (2001).
62. Smith, L. M. et al. Fluorescence detection in automated DNA sequence analysis. Nature 321, 674±
679 (1986).
63. Ju, J. Y., Ruan, C. C., Fuller, C. W., Glazer, A. N. & Mathies, R. A. Fluorescence energy-transfer dye-
labeled primers for DNA sequencing and analysis. Proc. Natl Acad. Sci. USA 92, 4347±4351 (1995).
64. Lee, L. G. et al. New energy transfer dyes for DNA sequencing. Nucleic Acids Res. 25, 2816±2822 (1997).
65. Rosenblum, B. B. et al. New dye-labeled terminators for improved DNA sequencing patterns.
Nucleic Acids Res. 25, 4500±4504 (1997).
66. Metzker, M. L., Lu, J. & Gibbs, R. A. Electrophoretically uniform ¯uorescent dyes for automated
DNA sequencing. Science 271, 1420±1422 (1996).
67. Prober, J. M. et al. A system for rapid DNA sequencing with ¯uorescent chain-terminating
dideoxynucleotides. Science 238, 336±341 (1987).
68. Reeve, M. A. & Fuller, C. W. A novel thermostable polymerase for DNA sequencing. Nature 376,
796±797 (1995).
69. Tabor, S. & Richardson, C. C. Selective inactivation of the exonuclease activity of bacteriophage T7
DNA polymerase by in vitro mutagenesis. J. Biol. Chem. 264, 6447±6458 (1989).
70. Tabor, S. & Richardson, C. C. DNA sequence analysis with a modi®ed bacteriophage T7 DNA
polymeraseÐeffect of pyrophosphorolysis and metal ions. J. Biol. Chem. 265, 8322±8328 (1990).
71. Murray, V. Improved double-stranded DNA sequencing using the linear polymerase chain reaction.
Nucleic Acids Res. 17, 8889 (1989).
72. Guttman, A., Cohen, A. S., Heiger, D. N. & Karger, B. L. Analytical and micropreparative ultrahigh
resolution of oligonucleotides by polyacrylamide-gel high-performance capillary electrophoresis.
Anal. Chem. 62, 137±141 (1990).
73. Luckey, J. A. et al. High-speed DNA sequencing by capillary electrophoresis. Nucleic Acids Res. 18,
4417±4421 (1990).
74. Swerdlow, H., Wu, S., Harke, H. & Dovichi, N. J. Capillary gel-electrophoresis for DNA sequencingÐ
laser-induced ¯uorescence detection with the sheath ¯ow cuvette. J. Chromatogr. 516, 61±67 (1990).
75. Meldrum, D. Automation for genomics, part one: preparation for sequencing. Genome Res. 10,
1081±1092 (2000).
76. Meldrum, D. Automation for genomics, part two: sequencers, microarrays, and future trends.
Genome Res. 10, 1288±1303 (2000).
77. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome Res. 8, 186±194 (1998).
78. Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using
phred. I. Accuracy assessment. Genome Res. 8, 175±185 (1998).
79. Bentley, D. R. Genomic sequence information should be released immediately and freely in the
public domain. Science 274, 533±534 (1996).
80. Guyer, M. Statement on the rapid release of genomic DNA sequence. Genome Res. 8, 413 (1998).
81. Dietrich, W. et al. A genetic map of the mouse suitable for typing intraspeci®c crosses. Genetics 131,
423±447 (1992).
82. Kim, U. J. et al. Construction and characterization of a human bacterial arti®cial chromosome
library. Genomics 34, 213±218 (1996).
83. Osoegawa, K. et al. Bacterial arti®cial chromosome libraries for mouse sequencing and functional
analysis. Genome Res. 10, 116±128 (2000).
84. Marra, M. A. et al. High throughput ®ngerprint analysis of large-insert clones. Genome Res. 7, 1072±
1084 (1997).
85. Marra, M. et al. A map for sequence analysis of the Arabidopsis thaliana genome. Nature Genet. 22,
265±270 (1999).
86. The International Human Genome Mapping Consortium. A physical map of the human genome.
Nature 409, 934±941 (2001).
87. Zhao, S. et al. Human BAC ends quality assessment and sequence analyses. Genomics 63, 321±332
(2000).
88. Mahairas, G. G. et al. Sequence-tagged connectors: A sequence approach to mapping and scanning
the human genome. Proc. Natl Acad. Sci. USA 96, 9739±9744 (1999).
89. Tilford, C. A. et al. A physical map of the human Y chromosome. Nature 409, 943±945 (2001).
90. Bentley, D. R. et al. The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X.
Nature 409, 942±943 (2001).
91. Montgomery, K. T. et al. A high-resolution map of human chromosome 12. Nature 409, 945±946
(2001).
92. Bru
È
ls, T. et al. A physical map of human chromosome 14. Nature 409, 947±948 (2001).
93. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311±319 (2000).
94. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489±495 (1999).
95. Cox, D. et al. Radiation hybrid map of the human genome. Science (in the press).
96. Osoegawa, K. et al. An improved approach for construction of bacterial arti®cial chromosome
libraries. Genomics 52, 1±8 (1998).
97. The International SNP Map Working Group. A map of human genome sequence variation
containing 1.42 million single nucleotide polymorphisms. Nature 409, 928±933 (2001).
98. Collins, F. S., Brooks, L. D. & Chakravarti, A. A DNA polymorphism discovery resource for research
on human genetic variation. Genome Res. 8, 1229±1231 (1998).
99. Stewart, E. A. et al. An STS-based radiation hybrid map of the human genome. Genome Res. 7, 422±
433 (1997).
100. Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744±746 (1998).
101. Dib, C. et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites.
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 915© 2001 Macmillan Magazines Ltd
Page 57
Nature 380, 152±154 (1996).
102. Broman, K. W., Murray, J. C., Shef®eld, V. C., White, R. L. & Weber, J. L. Comprehensive human
genetic maps: individual and sex-speci®c variation in recombination. Am. J. Hum. Genet. 63, 861±
869 (1998).
103. The BAC Resource Consortium. Integration of cytogenetic landmarks into the draft sequence of the
human genome. Nature 409, 953±958 (2001).
104. Kent, W. J. & Haussler, D. GigAssembler: an algorithm for the initial assembly of the human working
draft . Technical Report UCSC-CRL-00-17 (Univ. California at Santa Cruz, Santa Cruz, California,
2001).
105. Morton, N. E. Parameters of the human genome. Proc. Natl Acad. Sci. USA 88, 7474±7476 (1991).
106. Podugolnikova, O. A. & Blumina, M. G. Heterochromatic regions on chromosomes 1, 9, 16, and Y in
children with some disturbances occurring during embryo development. Hum. Genet. 63, 183±188
(1983).
107. Lundgren, R., Berger, R. & Kristoffersson, U. Constitutive heterochromatin C-band polymorphism
in prostatic cancer. Cancer Genet. Cytogenet. 51, 57±62 (1991).
108. Lee, C., Wevrick, R., Fisher, R. B., Ferguson-Smith, M. A. & Lin, C. C. Human centromeric DNAs.
Hum. Genet. 100, 291±304 (1997).
109. Riethman, H. C. et al. Integration of telomere sequences with the draft human genome sequence.
Nature 409, 953±958 (2001).
110. Pruit, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids
Res. 29, 137±140 (2001).
111. Wolfsberg, T. G., McEntyre, J. & Schuler, G. D. Guide to the draft human genome. Nature 409, 824±
826 (2001).
112. Hurst, L. D. & Eyre-Walker, A. Evolutionary genomics: reading the bands. Bioessays 22, 105±107
(2000).
113. Saccone, S. et al. Correlations between isochores and chromosomal bands in the human genome.
Proc. Natl Acad. Sci. USA 90, 11929±11933 (1993).
114. Zoubak, S., Clay, O. & Bernardi, G. The gene distribution of the human genome. Gene 174, 95±102
(1996).
115. Gardiner, K. Base composition and gene distribution: critical patterns in mammalian genome
organization. Trends Genet. 12, 519±524 (1996).
116. Duret, L., Mouchiroud, D. & Gautier, C. Statistical analysis of vertebrate sequences reveals that long
genes are scarce in GC-rich isochores. J. Mol. Evol. 40, 308±317 (1995).
117. Saccone, S., De Sario, A., Della Valle, G. & Bernardi, G. The highest gene concentrations in the
human genome are in telomeric bands of metaphase chromosomes. Proc. Natl Acad. Sci. USA 89,
4913±4917 (1992).
118. Bernardi, G. et al. The mosaic genome of warm-blooded vertebrates. Science 228, 953±958
(1985).
119. Bernardi, G. Isochores and the evolutionary genomics of vertebrates. Gene 241, 3±17 (2000).
120. Fickett, J. W., Torney, D. C. & Wolf, D. R. Base compositional structure of genomes. Genomics 13,
1056±1064 (1992).
121. Churchill, G. A. Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51, 79±94
(1989).
122. Bird, A., Taggart, M., Frommer, M., Miller, O. J. & Macleod, D. A fraction of the mouse genome that
is derived from islands of nonmethylated, CpG-rich DNA. Cell 40, 91±99 (1985).
123. Bird, A. P. CpG islands as gene markers in the vertebrate nucleus. Trends Genet. 3, 342±347 (1987).
124. Chan, M. F., Liang, G. & Jones, P. A. Relationship between transcription and DNA methylation. Curr.
Top. Microbiol. Immunol. 249, 75±86 (2000).
125. Holliday, R. & Pugh, J. E. DNA modi®cation mechanisms and gene activity during development.
Science 187, 226±232 (1975).
126. Larsen, F., Gundersen, G., Lopez, R. & Prydz, H. CpG islands as gene markers in the human genome.
Genomics 13, 1095±1107 (1992).
127. Tazi, J. & Bird, A. Alternative chromatin structure at CpG islands. Cell 60, 909±920 (1990).
128. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261±282
(1987).
129. Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl Acad.
Sci. USA 90, 11995±11999 (1993).
130. Ewing, B. & Green, P. Analysis of expressed sequence tags indicates 35,000 human genes. Nature
Genet. 25, 232±234 (2000).
131. Yu, A. Comparison of human genetic and sequence-based physical maps. Nature 409, 951±953 (2001).
132. Kaback, D. B., Guacci, V., Barber, D. & Mahon, J. W. Chromosome size-dependent control of meiotic
recombination. Science 256, 228±232 (1992).
133. Riles, L. et al. Physical maps of the 6 smallest chromosomes of Saccharomyces cerevisiae at a
resolution of 2.6-kilobase pairs. Genetics 134, 81±150 (1993).
134. Lynn, A. et al. Patterns of meiotic recombination on the long arm of human chromosome 21.
Genome Res. 10, 1319±1332 (2000).
135. Laurie, D. A. & Hulten, M. A. Further studies on bivalent chiasma frequency in human males with
normal karyotypes. Ann. Hum. Genet. 49, 189±201 (1985).
136. Roeder, G. S. Meiotic chromosomes: it takes two to tango. Genes Dev. 11, 2600±2621 (1997).
137. Wu, T.-C. & Lichten, M. Meiosis-induced double-strand break sites determined by yeast chromatin
structure. Science 263, 515±518 (1994).
138. Gerton, J. L. et al. Global mapping of meiotic recombination hotspots and coldspots in the yeast
Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 97, 11383±11390 (2000).
139. Li, W. -H. Molecular Evolution (Sinauer, Sunderland, Massachusetts, 1997).
140. Gregory, T. R. & Hebert, P. D. The modulation of DNA content: proximate causes and ultimate
consequences. Genome Res. 9, 317±324 (1999).
141. Hartl, D. L. Molecular melodies in high and low C. Nature Rev. Genet. 1, 145±149 (2000).
142. Smit, A. F. Interspersed repeats and other mementos of transposable elements in mammalian
genomes. Curr. Opin. Genet. Dev. 9, 657±663 (1999).
143. Prak, E. L. & Haig, H. K. Jr Mobile elements and the human genome. Nature Rev. Genet. 1, 134±144
(2000).
144. Okada, N., Hamada, M., Ogiwara, I. & Ohshima, K. SINEs and LINEs share common 39 sequences: a
review. Gene 205, 229±243 (1997).
145. Esnault, C., Maestre, J. & Heidmann, T. Human LINE retrotransposons generate processed
pseudogenes. Nature Genet. 24, 363±367 (2000).
146. Wei, W. et al. Human L1 retrotransposition: cis-preference vs. trans-complementation. Mol. Cell.
Biol. 21, 1429±1439 (2001)
147. Malik, H. S., Henikoff, S. & Eickbush, T. H. Poised for contagion: evolutionary origins of the
infectious abilities of invertebrate retroviruses. Genome Res. 10, 1307±1318 (2000).
148. Smit, A. F. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. 6, 743±
748 (1996).
149. Clark, J. B. & Tidwell, M. G. A phylogenetic perspective on P transposable element evolution in
Drosophila. Proc. Natl Acad. Sci. USA 94, 11428±11433 (1997).
150. Haring, E., Hagemann, S. & Pinsker, W. Ancient and recent horizontal invasions of Drosophilids by
P elements. J. Mol. Evol. 51, 577±586 (2000).
151. Koga, A. et al. Evidence for recent invasion of the medaka ®sh genome by the Tol2 transposable
element. Genetics 155, 273±281 (2000).
152. Robertson, H. M. & Lampe, D. J. Recent horizontal transfer of a mariner transposable element
among and between Diptera and Neuroptera. Mol. Biol. Evol. 12, 850±862 (1995).
153. Simmons, G. M. Horizontal transfer of hobo transposable elements within the Drosophila
melanogaster species complex: evidence from DNA sequencing. Mol. Biol. Evol. 9, 1050±1060
(1992).
154. Malik, H. S., Burke, W. D. & Eickbush, T. H. The age and evolution of non-LTR retrotransposable
elements. Mol. Biol. Evol. 16, 793±805 (1999).
155. Kordis, D. & Gubensek, F. Bov-B long interspersed repeated DNA (LINE) sequences are present in
Vipera ammodytes phospholipase A2 genes and in genomes of Viperidae snakes. Eur. J. Biochem. 246,
772±779 (1997).
156. Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet.
16, 418±420 (2000).
157. Sarich, V. M. & Wilson, A. C. Generation time and genome evolution in primates. Science 179,
1144±1147 (1973).
158. Smit, A. F., Toth, G., Riggs, A. D., & Jurka, J. Ancestral, mammalian-wide subfamilies of LINE-1
repetitive sequences. J. Mol. Biol. 246, 401±417 (1995).
159. Lim, J. K. & Simmons, M. J. Gross chromosome rearrangements mediated by transposable elements
in Drosophila melanogaster. Bioessays 16, 269±275 (1994).
160. Caceres, M., Ranz, J. M., Barbadilla, A., Long, M. & Ruiz, A. Generation of a widespread Drosophila
inversion by a transposable element. Science 285, 415±418 (1999).
161. Gray, Y. H. It takes two transposons to tango: transposable-element-mediated chromosomal
rearrangements. Trends Genet. 16, 461±468 (2000).
162. Zhang, J. & Peterson, T. Genome rearrangements by nonlinear transposons in maize. Genetics 153,
1403±1410 (1999).
163. Smit, A. F. Identi®cation of a new, abundant superfamily of mammalian LTR-transposons. Nucleic
Acids Res. 21, 1863±1872 (1993).
164. Cordonnier, A., Casella, J. F. & Heidmann, T. Isolation of novel human endogenous retrovirus-like
elements with foamy virus-related pol sequence. J. Virol. 69, 5890±5897 (1995).
165. Medstrand, P. & Mager, D. L. Human-speci®c integrations of the HERV-K endogenous retrovirus
family. J. Virol. 72, 9782±9787 (1998).
166. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196±2204 (2000).
167. Petrov, D. A., Lozovskaya, E. R. & Hartl, D. L. High intrinsic rate of DNA loss in Drosophila. Nature
384, 346±349 (1996).
168. Li, W. H., Ellsworth, D. L., Krushkal, J., Chang, B. H. & Hewett-Emmett, D. Rates of nucleotide
substitution in primates and rodents and the generation-time effect hypothesis. Mol. Phylogenet.
Evol. 5, 182±187 (1996).
169. Goodman, M. et al. Toward a phylogenetic classi®cation of primates based on DNA evidence
complemented by fossil evidence. Mol. Phylogenet. Evol. 9, 585±598 (1998).
170. Kazazian, H. H. Jr & Moran, J. V. The impact of L1 retrotransposons on the human genome. Nature
Genet. 19, 19±24 (1998).
171. Malik, H. S. & Eickbush, T. H. NeSL-1, an ancient lineage of site-speci®c non-LTR retrotransposons
from Caenorhabditis elegans. Genetics 154, 193±203 (2000).
172. Casavant, N. C. et al. The end of the LINE?: lack of recent L1 activity in a group of South American
rodents. Genetics 154, 1809±1817 (2000).
173. Meunier-Rotival, M., Soriano, P., Cuny, G., Strauss, F. & Bernardi, G. Sequence organization and
genomic distribution of the major family of interspersed repeats of mouse DNA. Proc. Natl Acad. Sci.
USA 79, 355±359 (1982).
174. Soriano, P., Meunier-Rotival, M. & Bernardi, G. The distribution of interspersed repeats is
nonuniform and conserved in the mouse and human genomes. Proc. Natl Acad. Sci. USA 80, 1816±
1820 (1983).
175. Goldman, M. A., Holmquist, G. P., Gray, M. C., Caston, L. A. & Nag, A. Replication timing of genes
and middle repetitive sequences. Science 224, 686±692 (1984).
176. Manuelidis, L. & Ward, D. C. Chromosomal and nuclear distribution of the HindIII 1.9-kb human
DNA repeat segment. Chromosoma 91, 28±38 (1984).
177. Feng, Q., Moran, J. V., Kazazian, H. H. Jr & Boeke, J. D. Human L1 retrotransposon encodes a
conserved endonuclease required for retrotransposition. Cell 87, 905±916 (1996).
178. Jurka, J. Sequence patterns indicate an enzymatic involvement in integration of mammalian
retroposons. Proc. Natl Acad. Sci. USA 94, 1872±1877 (1997).
179. Arcot, S. S. et al. High-resolution cartography of recently integrated human chromosome 19-speci®c
Alu fossils. J. Mol. Biol. 281, 843±856 (1998).
180. Schmid, C. W. Does SINE evolution preclude Alu function? Nucleic Acids Res. 26, 4541±4550 (1998).
181. Chu, W. M., Ballard, R., Carpick, B. W., Williams, B. R. & Schmid, C. W. Potential Alu function:
regulation of the activity of double-stranded RNA-activated kinase PKR. Mol. Cell. Biol. 18, 58±68
(1998).
182. Li, T., Spearow, J., Rubin, C. M. & Schmid, C. W. Physiological stresses increase mouse short
interspersed element (SINE) RNA expression in vivo. Gene 239, 367±372 (1999).
183. Liu, W. M., Chu, W. M., Choudary, P. V. & Schmid, C. W. Cell stress and translational inhibitors
transiently increase the abundance of mammalian SINE transcripts. Nucleic Acids Res. 23, 1758±
1765 (1995).
184. Filipski, J. Correlation between molecular clock ticking, codon usage ®delity of DNA repair,
chromosome banding and chromatin compactness in germline cells. FEBS Lett. 217, 184±186
(1987).
185. Sueoka, N. Directional mutation pressure and neutral molecular evolution. Proc. Natl Acad. Sci.
articles
916 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
102. Broman, K. W., Murray, J. C., Shef®eld, V. C., White, R. L. & Weber, J. L. Comprehensive human
genetic maps: individual and sex-speci®c variation in recombination. Am. J. Hum. Genet. 63, 861±
869 (1998).
103. The BAC Resource Consortium. Integration of cytogenetic landmarks into the draft sequence of the
human genome. Nature 409, 953±958 (2001).
104. Kent, W. J. & Haussler, D. GigAssembler: an algorithm for the initial assembly of the human working
draft . Technical Report UCSC-CRL-00-17 (Univ. California at Santa Cruz, Santa Cruz, California,
2001).
105. Morton, N. E. Parameters of the human genome. Proc. Natl Acad. Sci. USA 88, 7474±7476 (1991).
106. Podugolnikova, O. A. & Blumina, M. G. Heterochromatic regions on chromosomes 1, 9, 16, and Y in
children with some disturbances occurring during embryo development. Hum. Genet. 63, 183±188
(1983).
107. Lundgren, R., Berger, R. & Kristoffersson, U. Constitutive heterochromatin C-band polymorphism
in prostatic cancer. Cancer Genet. Cytogenet. 51, 57±62 (1991).
108. Lee, C., Wevrick, R., Fisher, R. B., Ferguson-Smith, M. A. & Lin, C. C. Human centromeric DNAs.
Hum. Genet. 100, 291±304 (1997).
109. Riethman, H. C. et al. Integration of telomere sequences with the draft human genome sequence.
Nature 409, 953±958 (2001).
110. Pruit, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids
Res. 29, 137±140 (2001).
111. Wolfsberg, T. G., McEntyre, J. & Schuler, G. D. Guide to the draft human genome. Nature 409, 824±
826 (2001).
112. Hurst, L. D. & Eyre-Walker, A. Evolutionary genomics: reading the bands. Bioessays 22, 105±107
(2000).
113. Saccone, S. et al. Correlations between isochores and chromosomal bands in the human genome.
Proc. Natl Acad. Sci. USA 90, 11929±11933 (1993).
114. Zoubak, S., Clay, O. & Bernardi, G. The gene distribution of the human genome. Gene 174, 95±102
(1996).
115. Gardiner, K. Base composition and gene distribution: critical patterns in mammalian genome
organization. Trends Genet. 12, 519±524 (1996).
116. Duret, L., Mouchiroud, D. & Gautier, C. Statistical analysis of vertebrate sequences reveals that long
genes are scarce in GC-rich isochores. J. Mol. Evol. 40, 308±317 (1995).
117. Saccone, S., De Sario, A., Della Valle, G. & Bernardi, G. The highest gene concentrations in the
human genome are in telomeric bands of metaphase chromosomes. Proc. Natl Acad. Sci. USA 89,
4913±4917 (1992).
118. Bernardi, G. et al. The mosaic genome of warm-blooded vertebrates. Science 228, 953±958
(1985).
119. Bernardi, G. Isochores and the evolutionary genomics of vertebrates. Gene 241, 3±17 (2000).
120. Fickett, J. W., Torney, D. C. & Wolf, D. R. Base compositional structure of genomes. Genomics 13,
1056±1064 (1992).
121. Churchill, G. A. Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51, 79±94
(1989).
122. Bird, A., Taggart, M., Frommer, M., Miller, O. J. & Macleod, D. A fraction of the mouse genome that
is derived from islands of nonmethylated, CpG-rich DNA. Cell 40, 91±99 (1985).
123. Bird, A. P. CpG islands as gene markers in the vertebrate nucleus. Trends Genet. 3, 342±347 (1987).
124. Chan, M. F., Liang, G. & Jones, P. A. Relationship between transcription and DNA methylation. Curr.
Top. Microbiol. Immunol. 249, 75±86 (2000).
125. Holliday, R. & Pugh, J. E. DNA modi®cation mechanisms and gene activity during development.
Science 187, 226±232 (1975).
126. Larsen, F., Gundersen, G., Lopez, R. & Prydz, H. CpG islands as gene markers in the human genome.
Genomics 13, 1095±1107 (1992).
127. Tazi, J. & Bird, A. Alternative chromatin structure at CpG islands. Cell 60, 909±920 (1990).
128. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261±282
(1987).
129. Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl Acad.
Sci. USA 90, 11995±11999 (1993).
130. Ewing, B. & Green, P. Analysis of expressed sequence tags indicates 35,000 human genes. Nature
Genet. 25, 232±234 (2000).
131. Yu, A. Comparison of human genetic and sequence-based physical maps. Nature 409, 951±953 (2001).
132. Kaback, D. B., Guacci, V., Barber, D. & Mahon, J. W. Chromosome size-dependent control of meiotic
recombination. Science 256, 228±232 (1992).
133. Riles, L. et al. Physical maps of the 6 smallest chromosomes of Saccharomyces cerevisiae at a
resolution of 2.6-kilobase pairs. Genetics 134, 81±150 (1993).
134. Lynn, A. et al. Patterns of meiotic recombination on the long arm of human chromosome 21.
Genome Res. 10, 1319±1332 (2000).
135. Laurie, D. A. & Hulten, M. A. Further studies on bivalent chiasma frequency in human males with
normal karyotypes. Ann. Hum. Genet. 49, 189±201 (1985).
136. Roeder, G. S. Meiotic chromosomes: it takes two to tango. Genes Dev. 11, 2600±2621 (1997).
137. Wu, T.-C. & Lichten, M. Meiosis-induced double-strand break sites determined by yeast chromatin
structure. Science 263, 515±518 (1994).
138. Gerton, J. L. et al. Global mapping of meiotic recombination hotspots and coldspots in the yeast
Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 97, 11383±11390 (2000).
139. Li, W. -H. Molecular Evolution (Sinauer, Sunderland, Massachusetts, 1997).
140. Gregory, T. R. & Hebert, P. D. The modulation of DNA content: proximate causes and ultimate
consequences. Genome Res. 9, 317±324 (1999).
141. Hartl, D. L. Molecular melodies in high and low C. Nature Rev. Genet. 1, 145±149 (2000).
142. Smit, A. F. Interspersed repeats and other mementos of transposable elements in mammalian
genomes. Curr. Opin. Genet. Dev. 9, 657±663 (1999).
143. Prak, E. L. & Haig, H. K. Jr Mobile elements and the human genome. Nature Rev. Genet. 1, 134±144
(2000).
144. Okada, N., Hamada, M., Ogiwara, I. & Ohshima, K. SINEs and LINEs share common 39 sequences: a
review. Gene 205, 229±243 (1997).
145. Esnault, C., Maestre, J. & Heidmann, T. Human LINE retrotransposons generate processed
pseudogenes. Nature Genet. 24, 363±367 (2000).
146. Wei, W. et al. Human L1 retrotransposition: cis-preference vs. trans-complementation. Mol. Cell.
Biol. 21, 1429±1439 (2001)
147. Malik, H. S., Henikoff, S. & Eickbush, T. H. Poised for contagion: evolutionary origins of the
infectious abilities of invertebrate retroviruses. Genome Res. 10, 1307±1318 (2000).
148. Smit, A. F. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. 6, 743±
748 (1996).
149. Clark, J. B. & Tidwell, M. G. A phylogenetic perspective on P transposable element evolution in
Drosophila. Proc. Natl Acad. Sci. USA 94, 11428±11433 (1997).
150. Haring, E., Hagemann, S. & Pinsker, W. Ancient and recent horizontal invasions of Drosophilids by
P elements. J. Mol. Evol. 51, 577±586 (2000).
151. Koga, A. et al. Evidence for recent invasion of the medaka ®sh genome by the Tol2 transposable
element. Genetics 155, 273±281 (2000).
152. Robertson, H. M. & Lampe, D. J. Recent horizontal transfer of a mariner transposable element
among and between Diptera and Neuroptera. Mol. Biol. Evol. 12, 850±862 (1995).
153. Simmons, G. M. Horizontal transfer of hobo transposable elements within the Drosophila
melanogaster species complex: evidence from DNA sequencing. Mol. Biol. Evol. 9, 1050±1060
(1992).
154. Malik, H. S., Burke, W. D. & Eickbush, T. H. The age and evolution of non-LTR retrotransposable
elements. Mol. Biol. Evol. 16, 793±805 (1999).
155. Kordis, D. & Gubensek, F. Bov-B long interspersed repeated DNA (LINE) sequences are present in
Vipera ammodytes phospholipase A2 genes and in genomes of Viperidae snakes. Eur. J. Biochem. 246,
772±779 (1997).
156. Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet.
16, 418±420 (2000).
157. Sarich, V. M. & Wilson, A. C. Generation time and genome evolution in primates. Science 179,
1144±1147 (1973).
158. Smit, A. F., Toth, G., Riggs, A. D., & Jurka, J. Ancestral, mammalian-wide subfamilies of LINE-1
repetitive sequences. J. Mol. Biol. 246, 401±417 (1995).
159. Lim, J. K. & Simmons, M. J. Gross chromosome rearrangements mediated by transposable elements
in Drosophila melanogaster. Bioessays 16, 269±275 (1994).
160. Caceres, M., Ranz, J. M., Barbadilla, A., Long, M. & Ruiz, A. Generation of a widespread Drosophila
inversion by a transposable element. Science 285, 415±418 (1999).
161. Gray, Y. H. It takes two transposons to tango: transposable-element-mediated chromosomal
rearrangements. Trends Genet. 16, 461±468 (2000).
162. Zhang, J. & Peterson, T. Genome rearrangements by nonlinear transposons in maize. Genetics 153,
1403±1410 (1999).
163. Smit, A. F. Identi®cation of a new, abundant superfamily of mammalian LTR-transposons. Nucleic
Acids Res. 21, 1863±1872 (1993).
164. Cordonnier, A., Casella, J. F. & Heidmann, T. Isolation of novel human endogenous retrovirus-like
elements with foamy virus-related pol sequence. J. Virol. 69, 5890±5897 (1995).
165. Medstrand, P. & Mager, D. L. Human-speci®c integrations of the HERV-K endogenous retrovirus
family. J. Virol. 72, 9782±9787 (1998).
166. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196±2204 (2000).
167. Petrov, D. A., Lozovskaya, E. R. & Hartl, D. L. High intrinsic rate of DNA loss in Drosophila. Nature
384, 346±349 (1996).
168. Li, W. H., Ellsworth, D. L., Krushkal, J., Chang, B. H. & Hewett-Emmett, D. Rates of nucleotide
substitution in primates and rodents and the generation-time effect hypothesis. Mol. Phylogenet.
Evol. 5, 182±187 (1996).
169. Goodman, M. et al. Toward a phylogenetic classi®cation of primates based on DNA evidence
complemented by fossil evidence. Mol. Phylogenet. Evol. 9, 585±598 (1998).
170. Kazazian, H. H. Jr & Moran, J. V. The impact of L1 retrotransposons on the human genome. Nature
Genet. 19, 19±24 (1998).
171. Malik, H. S. & Eickbush, T. H. NeSL-1, an ancient lineage of site-speci®c non-LTR retrotransposons
from Caenorhabditis elegans. Genetics 154, 193±203 (2000).
172. Casavant, N. C. et al. The end of the LINE?: lack of recent L1 activity in a group of South American
rodents. Genetics 154, 1809±1817 (2000).
173. Meunier-Rotival, M., Soriano, P., Cuny, G., Strauss, F. & Bernardi, G. Sequence organization and
genomic distribution of the major family of interspersed repeats of mouse DNA. Proc. Natl Acad. Sci.
USA 79, 355±359 (1982).
174. Soriano, P., Meunier-Rotival, M. & Bernardi, G. The distribution of interspersed repeats is
nonuniform and conserved in the mouse and human genomes. Proc. Natl Acad. Sci. USA 80, 1816±
1820 (1983).
175. Goldman, M. A., Holmquist, G. P., Gray, M. C., Caston, L. A. & Nag, A. Replication timing of genes
and middle repetitive sequences. Science 224, 686±692 (1984).
176. Manuelidis, L. & Ward, D. C. Chromosomal and nuclear distribution of the HindIII 1.9-kb human
DNA repeat segment. Chromosoma 91, 28±38 (1984).
177. Feng, Q., Moran, J. V., Kazazian, H. H. Jr & Boeke, J. D. Human L1 retrotransposon encodes a
conserved endonuclease required for retrotransposition. Cell 87, 905±916 (1996).
178. Jurka, J. Sequence patterns indicate an enzymatic involvement in integration of mammalian
retroposons. Proc. Natl Acad. Sci. USA 94, 1872±1877 (1997).
179. Arcot, S. S. et al. High-resolution cartography of recently integrated human chromosome 19-speci®c
Alu fossils. J. Mol. Biol. 281, 843±856 (1998).
180. Schmid, C. W. Does SINE evolution preclude Alu function? Nucleic Acids Res. 26, 4541±4550 (1998).
181. Chu, W. M., Ballard, R., Carpick, B. W., Williams, B. R. & Schmid, C. W. Potential Alu function:
regulation of the activity of double-stranded RNA-activated kinase PKR. Mol. Cell. Biol. 18, 58±68
(1998).
182. Li, T., Spearow, J., Rubin, C. M. & Schmid, C. W. Physiological stresses increase mouse short
interspersed element (SINE) RNA expression in vivo. Gene 239, 367±372 (1999).
183. Liu, W. M., Chu, W. M., Choudary, P. V. & Schmid, C. W. Cell stress and translational inhibitors
transiently increase the abundance of mammalian SINE transcripts. Nucleic Acids Res. 23, 1758±
1765 (1995).
184. Filipski, J. Correlation between molecular clock ticking, codon usage ®delity of DNA repair,
chromosome banding and chromatin compactness in germline cells. FEBS Lett. 217, 184±186
(1987).
185. Sueoka, N. Directional mutation pressure and neutral molecular evolution. Proc. Natl Acad. Sci.
articles
916 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
Page 58
USA 85, 2653±2657 (1988).
186. Wolfe, K. H., Sharp, P. M. & Li, W. H. Mutation rates differ among regions of the mammalian
genome. Nature 337, 283±285 (1989).
187. Bains, W. Local sequence dependence of rate of base replacement in mammals. Mutat. Res. 267, 43±
54 (1992).
188. Mathews, C. K. & Ji, J. DNA precursor asymmetries, replication ®delity, and variable genome
evolution. Bioessays 14, 295±301 (1992).
189. Holmquist, G. P. & Filipski, J. Organization of mutations along the genome: a prime determinant of
genome evolution. Trends Ecol. Evol. 9, 65±68 (1994).
190. Eyre-Walker, A. Evidence of selection on silent site base composition in mammals: potential
implications for the evolution of isochores and junk DNA. Genetics 152, 675±683 (1999).
191. The International SNP Map Working Group. An SNP map of the human genome generated by
reduced representation shotgun sequencing. Nature 407, 513±516 (2000).
192. Bohossian, H. B., Skaletsky, H. & Page, D. C. Unexpectedly similar rates of nucleotide substitution
found in male and female hominids. Nature 406, 622±625 (2000).
193. Skowronski, J., Fanning, T. G. & Singer, M. F. Unit-length LINE-1 transcripts in human
teratocarcinoma cells. Mol. Cell. Biol. 8, 1385±1397 (1988).
194. Boissinot, S., Chevret, P. & Furano, A. V. L1 (LINE-1) retrotransposon evolution and ampli®cation
in recent human history. Mol. Biol. Evol. 17, 915±928 (2000).
195. Moran, J. V. Human L1 retrotransposition: insights and peculiarities learned from a cultured cell
retrotransposition assay. Genetica 107, 39±51 (1999).
196. Kazazian, H. H. Jr et al. Haemophilia A resulting from de novo insertion of L1 sequences represents a
novel mechanism for mutation in man. Nature 332, 164±166 (1988).
197. Sheen, F.-m. et al. Reading between the LINEs: Human genomic variation introduced by LINE-1
retrotransposition. Genome Res. 10, 1496±1508 (2000).
198. Dombroski, B. A., Mathias, S. L., Nanthakumar, E., Scott, A. F. & Kazazian, H. H. Jr Isolation of an
active human transposable element. Science 254, 1805±1808 (1991).
199. Holmes, S. E., Dombroski, B. A., Krebs, C. M., Boehm, C. D. & Kazazian, H. H. Jr A new
retrotransposable human L1 element from the LRE2 locus on chromosome 1q produces a chimaeric
insertion. Nature Genet. 7, 143±148 (1994).
200. Sassaman, D. M. et al. Many human L1 elements are capable of retrotransposition. Nature Genet. 16,
37±43 (1997).
201. Dombroski, B. A., Scott, A. F. & Kazazian, H. H. Jr Two additional potential retrotransposons
isolated from a human L1 subfamily that contains an active retrotransposable element. Proc. Natl
Acad. Sci. USA 90, 6513±6517 (1993).
202. Kimberland, M. L. et al. Full-length human L1 insertions retain the capacity for high frequency
retrotransposition in cultured cells. Hum. Mol. Genet. 8, 1557±1560 (1999).
203. Moran, J. V. et al. High frequency retrotransposition in cultured mammalian cells. Cell 87, 917±927
(1996).
204. Moran, J. V., DeBerardinis, R. J. & Kazazian, H. H. Jr Exon shuf¯ing by L1 retrotransposition. Science
283, 1530±1534 (1999).
205. Pickeral, O. K., Makalowski, W., Boguski, M. S. & Boeke, J. D. Frequent human genomic DNA
transduction driven by LINE-1 retrotransposition. Genome Res. 10, 411±415 (2000).
206. Miki, Y. et al. Disruption of the APC gene by a retrotransposal insertion of L1 sequence in a colon
cancer. Cancer Res. 52, 643±645 (1992).
207. Branciforte, D. & Martin, S. L. Developmental and cell type speci®city of LINE-1 expression in
mouse testis: implications for transposition. Mol. Cell. Biol. 14, 2584±2592 (1994).
208. Trelogan, S. A. & Martin, S. L. Tightly regulated, developmentally speci®c expression of the ®rst
open reading frame from LINE-1 during mouse embryogenesis. Proc. Natl Acad. Sci. USA 92, 1520±
1524 (1995).
209. Jurka, J. & Kapitonov, V. V. Sectorial mutagenesis by transposable elements. Genetica 107, 239±248
(1999).
210. Fraser, M. J., Ciszczon, T., Elick, T. & Bauser, C. Precise excision of TTAA-speci®c lepidopteran
transposons piggyBac (IFP2) and tagalong (TFP3) from the baculovirus genome in cell lines from
two species of Lepidoptera. Insect Mol. Biol. 5, 141±151 (1996).
211. Brosius, J. Genomes were forged by massive bombardments with retroelements and retrosequences.
Genetica 107, 209±238 (1999).
212. Kruglyak, S., Durrett, R. T., Schug, M. D. & Aquadro, C. F. Equilibrium distribution of microsatellite
repeat length resulting from a balance between slippage events and point mutations. Proc. Natl Acad.
Sci. USA 95, 10774±10778 (1998).
213. Toth, G., Gaspari, Z. & Jurka, J. Microsatellites in different eukaryotic genomes: survey and analysis.
Genome Res. 10, 967±981 (2000).
214. Ellegren, H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nature
Genet. 24, 400±402 (2000).
215. Ji, Y., Eichler, E. E., Schwartz, S. & Nicholls, R. D. Structure of chromosomal duplicons and their role
in mediating human genomic disorders. Genome Res. 10, 597±610 (2000).
216. Eichler, E. E. Masquerading repeats: paralogous pitfalls of the human genome. Genome Res. 8, 758±
762 (1998).
217. Mazzarella, R. & D. Schlessinger, D. Pathological consequences of sequence duplications in the
human genome. Genome Res. 8, 1007±1021 (1998).
218. Eichler, E. E. et al. Interchromosomal duplications of the adrenoleukodystrophy locus: a
phenomenon of pericentromeric plasticity. Hum. Mol. Genet. 6, 991±1002 (1997).
219. Horvath, J. E., Schwartz, S. & Eichler, E. E. The mosaic structure of human pericentromeric DNA: a
strategy for characterizing complex regions of the human genome. Genome Res. 10, 839±852 (2000).
220. Brand-Arpon, V. et al. A genomic region encompassing a cluster of olfactory receptor genes and a
myosin light chain kinase (MYLK) gene is duplicated on human chromosome regions 3q13-q21 and
3p13. Genomics 56, 98±110 (1999).
221. Arnold, N., Wienberg, J., Ermert, K. & Zachau, H. G. Comparative mapping of DNA probes derived
from the V kappa immunoglobulin gene regions on human and great ape chromosomes by
¯uorescence in situ hybridization. Genomics 26, 147±150 (1995).
222. Eichler, E. E. et al. Duplication of a gene-rich cluster between 16p11.1 and Xq28: a novel
pericentromeric-directed mechanism for paralogous genome evolution. Hum. Mol. Genet. 5, 899±
912 (1996).
223. Potier, M. et al. Two sequence-ready contigs spanning the two copies of a 200-kb duplication on
human 21q: partial sequence and polymorphisms. Genomics 51, 417±426 (1998).
224. Regnier, V. et al. Emergence and scattering of multiple neuro®bromatosis (NF1)-related sequences
during hominoid evolution suggest a process of pericentromeric interchromosomal transposition.
Hum. Mol. Genet. 6, 9±16 (1997).
225. Ritchie, R. J., Mattei, M. G. & Lalande, M. A large polymorphic repeat in the pericentromeric region
of human chromosome 15q contains three partial gene duplications. Hum. Mol. Genet. 7, 1253±
1260 (1998).
226. Trask, B. J. et al. Members of the olfactory receptor gene family are contained in large blocks of DNA
duplicated polymorphically near the ends of human chromosomes. Hum. Mol. Genet. 7, 13±26
(1998).
227. Trask, B. J. et al. Large multi-chromosomal duplications encompass many members of the olfactory
receptor gene family in the human genome. Hum. Mol. Genet. 7, 2007±2020 (1998).
228. van Deutekom, J. C. et al. Identi®cation of the ®rst gene (FRG1) from the FSHD region on human
chromosome 4q35. Hum. Mol. Genet. 5, 581±590 (1996).
229. Zachau, H. G. The immunoglobulin kappa locusÐorÐwhat has been learned from looking closely
at one-tenth of a percent of the human genome. Gene 135, 167±173 (1993).
230. Zimonjic, D. B., Kelley, M. J., Rubin, J. S., Aaronson, S. A. & Popescu, N. C. Fluorescence in situ
hybridization analysis of keratinocyte growth factor gene ampli®cation and dispersion in evolution
of great apes and humans. Proc. Natl Acad. Sci. USA 94, 11461±11465 (1997).
231. van Geel, M. et al. The FSHD region on human chromosome 4q35 contains potential coding regions
among pseudogenes and a high density of repeat elements. Genomics 61, 55±65 (1999).
232. Horvath, J. E. et al. Molecular structure and evolution of an alpha satellite/non-alpha satellite
junction at 16p11. Hum. Mol. Genet. 9, 113±123 (2000).
233. Guy, J. et al. Genomic sequence and transcriptional pro®le of the boundary between pericentromeric
satellites and genes on human chromosome arm 10q. Hum. Mol. Genet. 9, 2029±2042 (2000).
234. Reiter, L. T., Murakami, T., Koeuth, T., Gibbs, R. A. & Lupski, J. R. The human COX10 gene is
disrupted during homologous recombination between the 24 kb proximal and distal CMT1A-REPs.
Hum. Mol. Genet. 6, 1595±1603 (1997).
235. Amos-Landgraf, J. M. et al. Chromosome breakage in the Prader-Willi and Angelman syndromes
involves recombination between large, transcribed repeats at proximal and distal breakpoints. Am. J.
Hum. Genet. 65, 370±386 (1999).
236. Christian, S. L., Fantes, J. A., Mewborn, S. K., Huang, B. & Ledbetter, D. H. Large genomic duplicons
map to sites of instability in the Prader-Willi/Angelman syndrome chromosome region (15q11-
q13). Hum. Mol. Genet. 8, 1025±1037 (1999).
237. Edelmann, L., Pandita, R. K. & Morrow, B. E. Low-copy repeats mediate the common 3-Mb deletion
in patients with velo-cardio-facial syndrome. Am. J. Hum. Genet. 64, 1076±1086 (1999).
238. Shaikh, T. H. et al. Chromosome 22-speci®c low copy repeats and the 22q11.2 deletion syndrome:
genomic organization and deletion endpoint analysis. Hum. Mol. Genet. 9, 489±501 (2000).
239. Francke, U. Williams-Beuren syndrome: genes and mechanisms. Hum. Mol. Genet. 8, 1947±1954
(1999).
240. Peoples, R. et al. A physical map, including a BAC/PAC clone contig, of the Williams-Beuren
syndrome-deletion region at 7q11.23. Am. J. Hum. Genet. 66, 47±68 (2000).
241. Eichler, E. E., Archidiacono, N. & Rocchi, M. CAGGG repeats and the pericentromeric duplication
of the hominoid genome. Genome Res. 9, 1048±1058 (1999).
242. O'Keefe, C. & Eichler, E. in Comparative Genomics: Empirical and Analytical Approaches to Gene
Order Dynamics, Map Alignment and the Evolution of Gene Families (eds Sankoff, D. & Nadeau, J.)
29±46 (Kluwer Academic, Dordrecht, 2000).
243. Lander, E. S. The new genomics: Global views of biology. Science 274, 536±539 (1996).
244. Eddy, S. R. Noncoding RNA genes. Curr. Op. Genet. Dev. 9, 695±699 (1999).
245. Ban, N., Nissen, P., Hansen, J., Moore, P. B. & Steitz, T. A. The complete atomic structure of the large
ribosomal subunit at 2.4 angstrom resolution. Science 289, 905±920 (2000).
246. Nissen, P., Hansen, J., Ban, N., Moore, P. B. & Steitz, T. A. The structural basis of ribosome activity in
peptide bond synthesis. Science 289, 920±930 (2000).
247. Weinstein, L. B. & Steitz, J. A. Guided tours: from precursor snoRNA to functional snoRNP. Curr.
Opin. Cell Biol. 11, 378±384 (1999).
248. Bachellerie, J.-P. & Cavaille, J. in Modi®cation and Editing of RNA (ed. Benne, H. G. a. R.) 255±272
(ASM, Washington DC, 1998).
249. Burge, C. & Sharp, P. A. Classi®cation of introns: U2-type or U12-type. Cell 91, 875±879 (1997).
250. Brown, C. J. et al. The Human Xist geneÐanalysis of a 17 kb inactive X-speci®c RNA that contains
conserved repeats and is highly localized within the nucleus. Cell 71, 527±542 (1992).
251. Kickhoefer, V. A., Vasu, S. K. & Rome, L. H. Vaults are the answer, what is the question? Trends Cell
Biol. 6, 174±178 (1996).
252. Hatlen, L. & Attardi, G. Proportion of the HeLa cell genome complementary to the transfer RNA and
5S RNA. J. Mol. Biol. 56, 535±553 (1971).
253. Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A. & Steinberg, S. Compilation of tRNA sequences
and sequences of tRNA genes. Nucleic Acids Res. 26, 148±153 (1998).
254. Long, E. O. & Dawid, I. B. Repeated genes in eukaryotes. Annu. Rev. Biochem. 49, 727±764 (1980).
255. Crick, F. H. Codon±anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19, 548±555 (1966).
256. Guthrie, C. & Abelson, J. in The Molecular Biology of the Yeast Saccharomyces: Metabolism and Gene
Expression (eds Strathern, J. & Broach J.) 487±528 (Cold Spring Harbor Laboratory Press,
Cold Spring Harbor, New York, 1982).
257. Soll, D. & RajBhandary, U. (eds) tRNA: Structure, Biosynthesis, and Function (ASM, Washington DC,
1995).
258. Ikemura, T. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol.
Evol. 2, 13±34 (1985).
259. Bulmer, M. Coevolution of codon usage and transfer-RNA abundance. Nature 325, 728±730 (1987).
260. Duret, L. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal
translation of highly expressed genes. Trends Genet. 16, 287±289 (2000).
261. Sharp, P. M. & Matassi, G. Codon usage and genome evolution. Curr. Opin. Genet. Dev. 4, 851±860
(1994).
262. Buckland, R. A. A primate transfer-RNA gene cluster and the evolution of human chromosome 1.
Cytogenet. Cell Genet. 61, 1±4 (1992).
263. Gonos, E. S. & Goddard, J. P. Human tRNA-Glu genes: their copy number and organization. FEBS
Lett. 276, 138±142 (1990).
264. Sylvester, J. E. et al. The human ribosomal RNA genes: structure and organization of the complete
repeating unit. Hum. Genet. 73, 193±198 (1986).
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 917© 2001 Macmillan Magazines Ltd
186. Wolfe, K. H., Sharp, P. M. & Li, W. H. Mutation rates differ among regions of the mammalian
genome. Nature 337, 283±285 (1989).
187. Bains, W. Local sequence dependence of rate of base replacement in mammals. Mutat. Res. 267, 43±
54 (1992).
188. Mathews, C. K. & Ji, J. DNA precursor asymmetries, replication ®delity, and variable genome
evolution. Bioessays 14, 295±301 (1992).
189. Holmquist, G. P. & Filipski, J. Organization of mutations along the genome: a prime determinant of
genome evolution. Trends Ecol. Evol. 9, 65±68 (1994).
190. Eyre-Walker, A. Evidence of selection on silent site base composition in mammals: potential
implications for the evolution of isochores and junk DNA. Genetics 152, 675±683 (1999).
191. The International SNP Map Working Group. An SNP map of the human genome generated by
reduced representation shotgun sequencing. Nature 407, 513±516 (2000).
192. Bohossian, H. B., Skaletsky, H. & Page, D. C. Unexpectedly similar rates of nucleotide substitution
found in male and female hominids. Nature 406, 622±625 (2000).
193. Skowronski, J., Fanning, T. G. & Singer, M. F. Unit-length LINE-1 transcripts in human
teratocarcinoma cells. Mol. Cell. Biol. 8, 1385±1397 (1988).
194. Boissinot, S., Chevret, P. & Furano, A. V. L1 (LINE-1) retrotransposon evolution and ampli®cation
in recent human history. Mol. Biol. Evol. 17, 915±928 (2000).
195. Moran, J. V. Human L1 retrotransposition: insights and peculiarities learned from a cultured cell
retrotransposition assay. Genetica 107, 39±51 (1999).
196. Kazazian, H. H. Jr et al. Haemophilia A resulting from de novo insertion of L1 sequences represents a
novel mechanism for mutation in man. Nature 332, 164±166 (1988).
197. Sheen, F.-m. et al. Reading between the LINEs: Human genomic variation introduced by LINE-1
retrotransposition. Genome Res. 10, 1496±1508 (2000).
198. Dombroski, B. A., Mathias, S. L., Nanthakumar, E., Scott, A. F. & Kazazian, H. H. Jr Isolation of an
active human transposable element. Science 254, 1805±1808 (1991).
199. Holmes, S. E., Dombroski, B. A., Krebs, C. M., Boehm, C. D. & Kazazian, H. H. Jr A new
retrotransposable human L1 element from the LRE2 locus on chromosome 1q produces a chimaeric
insertion. Nature Genet. 7, 143±148 (1994).
200. Sassaman, D. M. et al. Many human L1 elements are capable of retrotransposition. Nature Genet. 16,
37±43 (1997).
201. Dombroski, B. A., Scott, A. F. & Kazazian, H. H. Jr Two additional potential retrotransposons
isolated from a human L1 subfamily that contains an active retrotransposable element. Proc. Natl
Acad. Sci. USA 90, 6513±6517 (1993).
202. Kimberland, M. L. et al. Full-length human L1 insertions retain the capacity for high frequency
retrotransposition in cultured cells. Hum. Mol. Genet. 8, 1557±1560 (1999).
203. Moran, J. V. et al. High frequency retrotransposition in cultured mammalian cells. Cell 87, 917±927
(1996).
204. Moran, J. V., DeBerardinis, R. J. & Kazazian, H. H. Jr Exon shuf¯ing by L1 retrotransposition. Science
283, 1530±1534 (1999).
205. Pickeral, O. K., Makalowski, W., Boguski, M. S. & Boeke, J. D. Frequent human genomic DNA
transduction driven by LINE-1 retrotransposition. Genome Res. 10, 411±415 (2000).
206. Miki, Y. et al. Disruption of the APC gene by a retrotransposal insertion of L1 sequence in a colon
cancer. Cancer Res. 52, 643±645 (1992).
207. Branciforte, D. & Martin, S. L. Developmental and cell type speci®city of LINE-1 expression in
mouse testis: implications for transposition. Mol. Cell. Biol. 14, 2584±2592 (1994).
208. Trelogan, S. A. & Martin, S. L. Tightly regulated, developmentally speci®c expression of the ®rst
open reading frame from LINE-1 during mouse embryogenesis. Proc. Natl Acad. Sci. USA 92, 1520±
1524 (1995).
209. Jurka, J. & Kapitonov, V. V. Sectorial mutagenesis by transposable elements. Genetica 107, 239±248
(1999).
210. Fraser, M. J., Ciszczon, T., Elick, T. & Bauser, C. Precise excision of TTAA-speci®c lepidopteran
transposons piggyBac (IFP2) and tagalong (TFP3) from the baculovirus genome in cell lines from
two species of Lepidoptera. Insect Mol. Biol. 5, 141±151 (1996).
211. Brosius, J. Genomes were forged by massive bombardments with retroelements and retrosequences.
Genetica 107, 209±238 (1999).
212. Kruglyak, S., Durrett, R. T., Schug, M. D. & Aquadro, C. F. Equilibrium distribution of microsatellite
repeat length resulting from a balance between slippage events and point mutations. Proc. Natl Acad.
Sci. USA 95, 10774±10778 (1998).
213. Toth, G., Gaspari, Z. & Jurka, J. Microsatellites in different eukaryotic genomes: survey and analysis.
Genome Res. 10, 967±981 (2000).
214. Ellegren, H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nature
Genet. 24, 400±402 (2000).
215. Ji, Y., Eichler, E. E., Schwartz, S. & Nicholls, R. D. Structure of chromosomal duplicons and their role
in mediating human genomic disorders. Genome Res. 10, 597±610 (2000).
216. Eichler, E. E. Masquerading repeats: paralogous pitfalls of the human genome. Genome Res. 8, 758±
762 (1998).
217. Mazzarella, R. & D. Schlessinger, D. Pathological consequences of sequence duplications in the
human genome. Genome Res. 8, 1007±1021 (1998).
218. Eichler, E. E. et al. Interchromosomal duplications of the adrenoleukodystrophy locus: a
phenomenon of pericentromeric plasticity. Hum. Mol. Genet. 6, 991±1002 (1997).
219. Horvath, J. E., Schwartz, S. & Eichler, E. E. The mosaic structure of human pericentromeric DNA: a
strategy for characterizing complex regions of the human genome. Genome Res. 10, 839±852 (2000).
220. Brand-Arpon, V. et al. A genomic region encompassing a cluster of olfactory receptor genes and a
myosin light chain kinase (MYLK) gene is duplicated on human chromosome regions 3q13-q21 and
3p13. Genomics 56, 98±110 (1999).
221. Arnold, N., Wienberg, J., Ermert, K. & Zachau, H. G. Comparative mapping of DNA probes derived
from the V kappa immunoglobulin gene regions on human and great ape chromosomes by
¯uorescence in situ hybridization. Genomics 26, 147±150 (1995).
222. Eichler, E. E. et al. Duplication of a gene-rich cluster between 16p11.1 and Xq28: a novel
pericentromeric-directed mechanism for paralogous genome evolution. Hum. Mol. Genet. 5, 899±
912 (1996).
223. Potier, M. et al. Two sequence-ready contigs spanning the two copies of a 200-kb duplication on
human 21q: partial sequence and polymorphisms. Genomics 51, 417±426 (1998).
224. Regnier, V. et al. Emergence and scattering of multiple neuro®bromatosis (NF1)-related sequences
during hominoid evolution suggest a process of pericentromeric interchromosomal transposition.
Hum. Mol. Genet. 6, 9±16 (1997).
225. Ritchie, R. J., Mattei, M. G. & Lalande, M. A large polymorphic repeat in the pericentromeric region
of human chromosome 15q contains three partial gene duplications. Hum. Mol. Genet. 7, 1253±
1260 (1998).
226. Trask, B. J. et al. Members of the olfactory receptor gene family are contained in large blocks of DNA
duplicated polymorphically near the ends of human chromosomes. Hum. Mol. Genet. 7, 13±26
(1998).
227. Trask, B. J. et al. Large multi-chromosomal duplications encompass many members of the olfactory
receptor gene family in the human genome. Hum. Mol. Genet. 7, 2007±2020 (1998).
228. van Deutekom, J. C. et al. Identi®cation of the ®rst gene (FRG1) from the FSHD region on human
chromosome 4q35. Hum. Mol. Genet. 5, 581±590 (1996).
229. Zachau, H. G. The immunoglobulin kappa locusÐorÐwhat has been learned from looking closely
at one-tenth of a percent of the human genome. Gene 135, 167±173 (1993).
230. Zimonjic, D. B., Kelley, M. J., Rubin, J. S., Aaronson, S. A. & Popescu, N. C. Fluorescence in situ
hybridization analysis of keratinocyte growth factor gene ampli®cation and dispersion in evolution
of great apes and humans. Proc. Natl Acad. Sci. USA 94, 11461±11465 (1997).
231. van Geel, M. et al. The FSHD region on human chromosome 4q35 contains potential coding regions
among pseudogenes and a high density of repeat elements. Genomics 61, 55±65 (1999).
232. Horvath, J. E. et al. Molecular structure and evolution of an alpha satellite/non-alpha satellite
junction at 16p11. Hum. Mol. Genet. 9, 113±123 (2000).
233. Guy, J. et al. Genomic sequence and transcriptional pro®le of the boundary between pericentromeric
satellites and genes on human chromosome arm 10q. Hum. Mol. Genet. 9, 2029±2042 (2000).
234. Reiter, L. T., Murakami, T., Koeuth, T., Gibbs, R. A. & Lupski, J. R. The human COX10 gene is
disrupted during homologous recombination between the 24 kb proximal and distal CMT1A-REPs.
Hum. Mol. Genet. 6, 1595±1603 (1997).
235. Amos-Landgraf, J. M. et al. Chromosome breakage in the Prader-Willi and Angelman syndromes
involves recombination between large, transcribed repeats at proximal and distal breakpoints. Am. J.
Hum. Genet. 65, 370±386 (1999).
236. Christian, S. L., Fantes, J. A., Mewborn, S. K., Huang, B. & Ledbetter, D. H. Large genomic duplicons
map to sites of instability in the Prader-Willi/Angelman syndrome chromosome region (15q11-
q13). Hum. Mol. Genet. 8, 1025±1037 (1999).
237. Edelmann, L., Pandita, R. K. & Morrow, B. E. Low-copy repeats mediate the common 3-Mb deletion
in patients with velo-cardio-facial syndrome. Am. J. Hum. Genet. 64, 1076±1086 (1999).
238. Shaikh, T. H. et al. Chromosome 22-speci®c low copy repeats and the 22q11.2 deletion syndrome:
genomic organization and deletion endpoint analysis. Hum. Mol. Genet. 9, 489±501 (2000).
239. Francke, U. Williams-Beuren syndrome: genes and mechanisms. Hum. Mol. Genet. 8, 1947±1954
(1999).
240. Peoples, R. et al. A physical map, including a BAC/PAC clone contig, of the Williams-Beuren
syndrome-deletion region at 7q11.23. Am. J. Hum. Genet. 66, 47±68 (2000).
241. Eichler, E. E., Archidiacono, N. & Rocchi, M. CAGGG repeats and the pericentromeric duplication
of the hominoid genome. Genome Res. 9, 1048±1058 (1999).
242. O'Keefe, C. & Eichler, E. in Comparative Genomics: Empirical and Analytical Approaches to Gene
Order Dynamics, Map Alignment and the Evolution of Gene Families (eds Sankoff, D. & Nadeau, J.)
29±46 (Kluwer Academic, Dordrecht, 2000).
243. Lander, E. S. The new genomics: Global views of biology. Science 274, 536±539 (1996).
244. Eddy, S. R. Noncoding RNA genes. Curr. Op. Genet. Dev. 9, 695±699 (1999).
245. Ban, N., Nissen, P., Hansen, J., Moore, P. B. & Steitz, T. A. The complete atomic structure of the large
ribosomal subunit at 2.4 angstrom resolution. Science 289, 905±920 (2000).
246. Nissen, P., Hansen, J., Ban, N., Moore, P. B. & Steitz, T. A. The structural basis of ribosome activity in
peptide bond synthesis. Science 289, 920±930 (2000).
247. Weinstein, L. B. & Steitz, J. A. Guided tours: from precursor snoRNA to functional snoRNP. Curr.
Opin. Cell Biol. 11, 378±384 (1999).
248. Bachellerie, J.-P. & Cavaille, J. in Modi®cation and Editing of RNA (ed. Benne, H. G. a. R.) 255±272
(ASM, Washington DC, 1998).
249. Burge, C. & Sharp, P. A. Classi®cation of introns: U2-type or U12-type. Cell 91, 875±879 (1997).
250. Brown, C. J. et al. The Human Xist geneÐanalysis of a 17 kb inactive X-speci®c RNA that contains
conserved repeats and is highly localized within the nucleus. Cell 71, 527±542 (1992).
251. Kickhoefer, V. A., Vasu, S. K. & Rome, L. H. Vaults are the answer, what is the question? Trends Cell
Biol. 6, 174±178 (1996).
252. Hatlen, L. & Attardi, G. Proportion of the HeLa cell genome complementary to the transfer RNA and
5S RNA. J. Mol. Biol. 56, 535±553 (1971).
253. Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A. & Steinberg, S. Compilation of tRNA sequences
and sequences of tRNA genes. Nucleic Acids Res. 26, 148±153 (1998).
254. Long, E. O. & Dawid, I. B. Repeated genes in eukaryotes. Annu. Rev. Biochem. 49, 727±764 (1980).
255. Crick, F. H. Codon±anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19, 548±555 (1966).
256. Guthrie, C. & Abelson, J. in The Molecular Biology of the Yeast Saccharomyces: Metabolism and Gene
Expression (eds Strathern, J. & Broach J.) 487±528 (Cold Spring Harbor Laboratory Press,
Cold Spring Harbor, New York, 1982).
257. Soll, D. & RajBhandary, U. (eds) tRNA: Structure, Biosynthesis, and Function (ASM, Washington DC,
1995).
258. Ikemura, T. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol.
Evol. 2, 13±34 (1985).
259. Bulmer, M. Coevolution of codon usage and transfer-RNA abundance. Nature 325, 728±730 (1987).
260. Duret, L. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal
translation of highly expressed genes. Trends Genet. 16, 287±289 (2000).
261. Sharp, P. M. & Matassi, G. Codon usage and genome evolution. Curr. Opin. Genet. Dev. 4, 851±860
(1994).
262. Buckland, R. A. A primate transfer-RNA gene cluster and the evolution of human chromosome 1.
Cytogenet. Cell Genet. 61, 1±4 (1992).
263. Gonos, E. S. & Goddard, J. P. Human tRNA-Glu genes: their copy number and organization. FEBS
Lett. 276, 138±142 (1990).
264. Sylvester, J. E. et al. The human ribosomal RNA genes: structure and organization of the complete
repeating unit. Hum. Genet. 73, 193±198 (1986).
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 917© 2001 Macmillan Magazines Ltd
Page 59
265. Sorensen, P. D. & Frederiksen, S. Characterization of human 5S ribosomal RNA genes. Nucleic Acids
Res. 19, 4147±4151 (1991).
266. Timofeeva, M. et al. [Organization of a 5S ribosomal RNA gene cluster in the human genome].
Mol. Biol. (Mosk.) 27, 861±868 (1993).
267. Little, R. D. & Braaten, D. C. Genomic organization of human 5S rDNA and sequence of one tandem
repeat. Genomics 4, 376±383 (1989).
268. Maden, B. E. H. The numerous modi®ed nucleotides in eukaryotic ribosomal RNA. Prog. Nucleic
Acid Res. Mol. Biol. 39, 241±303 (1990).
269. Tycowski, K. T., You, Z. H., Graham, P. J. & Steitz, J. A. Modi®cation of U6 spliceosomal RNA is
guided by other small RNAs. Mol. Cell 2, 629±638 (1998).
270. Pavelitz, T., Liao, D. Q. & Weiner, A. M. Concerted evolution of the tandem array encoding primate
U2 snRNA (the RNU2 locus) is accompanied by dramatic remodeling of the junctions with ¯anking
chromosomal sequences. EMBO J. 18, 3783±3792 (1999).
271. Lindgren, V., Ares, A., Weiner, A. M. & Francke, U. Human genes for U2 small nuclear RNA map to a
major adenovirus 12 modi®cation site on chromosome 17. Nature 314, 115±116 (1985).
272. Van Arsdell, S. W. & Weiner, A. M. Human genes for U2 small nuclear RNA are tandemly repeated.
Mol. Cell. Biol. 4, 492±499 (1984).
273. Gao, L. I., Frey, M. R. & Matera, A. G. Human genes encoding U3 snRNA associate with coiled
bodies in interphase cells and are clustered on chromosome 17p11. 2 in a complex inverted repeat
structure. Nucleic Acids Res. 25, 4740±4747 (1997).
274. Hawkins, J. D. A survey on intron and exon lengths. Nucleic Acids Res. 16, 9893±9908 (1988).
275. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol.
268, 78±94 (1997).
276. Labeit, S. & Kolmerer, B. Titins: giant proteins in charge of muscle ultrastructure and elasticity.
Science 270, 293±296 (1995).
277. Sterner, D. A., Carlo, T. & Berget, S. M. Architectural limits on split genes. Proc. Natl Acad. Sci. USA
93, 15081±15085 (1996).
278. Sun, Q., Mayeda, A., Hampson, R. K., Krainer, A. R. & Rottman, F. M. General splicing factor SF2/
ASF promotes alternative splicing by binding to an exonic splicing enhancer. Genes Dev. 7, 2598±
2608 (1993).
279. Tanaka, K., Watakabe, A. & Shimura, Y. Polypurine sequences within a downstream exon function as
a splicing enhancer. Mol. Cell. Biol. 14, 1347±1354 (1994).
280. Carlo, T., Sterner, D. A. & Berget, S. M. An intron splicing enhancer containing a G-rich repeat
facilitates inclusion of a vertebrate micro-exon. RNA 2, 342±353 (1996).
281. Burset, M., Seledtsov, I. A. & Solovyev, V. V. Analysis of canonical and non-canonical splice sites in
mammalian genomes. Nucleic Acids Res. 28, 4364±4375 (2000).
282. Burge, C. B., Padgett, R. A. & Sharp, P. A. Evolutionary fates and origins of U12-type introns.
Mol. Cell 2, 773±785 (1998).
283. Mironov, A. A., Fickett, J. W. & Gelfand, M. S. Frequent alternative splicing of human genes. Genome
Res. 9, 1288±1293 (1999).
284. Hanke, J. et al. Alternative splicing of human genes: more the rule than the exception? Trends Genet.
15, 389±390 (1999).
285. Brett, D. et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice
forms. FEBS Lett. 474, 83±86 (2000).
286. Dunham, I. The gene guessing game. Yeast 17, 218±224 (2000).
287. Lewin, B. Gene Expression (Wiley, New York, 1980).
288. Lewin, B. Genes IV 466±481 (Oxford Univ. Press, Oxford, 1990).
289. Smaglik, P. Researchers take a gamble on the human genome. Nature 405, 264 (2000).
290. Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human genome? Nature
Genet. 7, 345±346 (1994).
291. Liang, F. et al. Gene index analysis of the human genome estimates approximately 120,000 genes.
Nature Genet. 25, 239±240 (2000).
292. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using
Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235±238 (2000).
293. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: A platform
for investigating biology. Science 282, 2012±2018 (1998).
294. Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204±2215 (2000).
295. Green, P. et al. Ancient conserved regions in new gene sequences and the protein databases. Science
259, 1711±1716 (1993).
296. Fraser, A. G. et al. Functional genomic analysis of C. elegans chromosome I by systematic RNA
interference. Nature 408, 325±330 (2000).
297. Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA.
Comput. Appl. Biosci. 13, 477±478 (1997).
298. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. A computer program for aligning a
cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967±974 (1998).
299. Bailey, L. C. Jr, Searls, D. B. & Overton, G. C. Analysis of EST-driven gene annotation in human
genomic sequence. Genome Res. 8, 362±376 (1998).
300. Birney, E., Thompson, J. D. & Gibson, T. J. PairWise and SearchWise: ®nding the optimal alignment
in a simultaneous comparison of a protein pro®le against all DNA translation frames. Nucleic Acids
Res. 24, 2730±2739 (1996).
301. Gelfand, M. S., Mironov, A. A. & Pevzner, P. A. Gene recognition via spliced sequence alignment.
Proc. Natl Acad. Sci. USA 93, 9061±9066 (1996).
302. Kulp, D., Haussler, D., Reese, M. G. & Eeckman, F. H. A generalized hidden Markov model for the
recognition of human genes in DNA. ISMB 4, 134±142 (1996).
303. Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. GenieÐgene ®nding in Drosophila
melanogaster. Genome Res. 10, 529±538 (2000).
304. Solovyev, V. & Salamov, A. The Gene-Finder computer tools for analysis of human and model
organisms genome sequences. ISMB 5, 294±302 (1997).
305. Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of gene prediction
accuracy in large DNA sequences. Genome Res. 10, 1631±1642 (2000).
306. Hubbard, T. & Birney, E. Open annotation offers a democratic solution to genome sequencing.
Nature 403, 825 (2000).
307. Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 28, 263±266 (2000).
308. Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10,
547±548 (2000).
309. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium.
Functional annotation of a full-length mouse cDNA collection. Nature 409, 685±690 (2001).
310. Basrai, M. A., Hieter, P. & Boeke, J. D. Small open reading frames: beautiful needles in the haystack.
Genome Res. 7, 768±771 (1997).
311. Janin, J. & Chothia, C. Domains in proteins: de®nitions, location, and structural principles. Methods
Enzymol. 115, 420±430 (1985).
312. Ponting, C. P., Schultz, J., Copley, R. R., Andrade, M. A. & Bork, P. Evolution of domain families.
Adv. Protein Chem. 54, 185±244 (2000).
313. Doolittle, R. F. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64, 287±314 (1995).
314. Bateman, A. & Birney, E. Searching databases to ®nd protein domain organization. Adv. Protein
Chem. 54, 137±157 (2000).
315. Futreal, P. A. et al. Cancer and genomics. Nature 409, 850±852 (2001).
316. Nestler, E. J. & Landsman, D. Learning about addiction from the human draft genome. Nature 409,
834±835 (2001).
317. Tupler, R., Perini, G. & Green, M. R. Expressing the human genome. Nature 409, 832±835 (2001).
318. Fahrer, A. M., Bazan, J. F., Papathanasiou, P., Nelms, K. A. & Goodnow, C. C. A genomic view of
immunology. Nature 409, 836±838 (2001).
319. Li, W. -H., Gu, Z., Wang, H. & Nekrutenko, A. Evolutionary analyses of the human genome. Nature
409, 847±849 (2001).
320. Bock, J. B., Matern, H. T., Peden, A. A. & Scheller, R. H. A genomic perspective on membrane
compartment organization. Nature 409, 839±841 (2001).
321. Pollard, T. D. Genomics, the cytoskeleton and motility. Nature 409, 842±843 (2001).
322. Murray, A. W. & Marks, D. Can sequencing shed light on cell cycling? Nature 409, 844±846 (2001).
323. Clayton, J. D., Kyriacou, C. P. & Reppert, S. M. Keeping time with the human genome. Nature 409,
829±831 (2001).
324. Chervitz, S. A. et al. Comparison of the complete protein sets of worm and yeast: orthology and
divergence. Science 282, 2022±2028 (1998).
325. Aravind, L. & Subramanian, G. Origin of multicellular eukaryotesÐinsights from proteome
comparisons. Curr. Opin. Genet. Dev. 9, 688±694 (1999).
326. Attwood, T. K. et al. PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28, 225±
227 (2000).
327. Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic
Acids Res. 27, 215±219 (1999).
328. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389±3402 (1997).
329. Wolf, Y. I., Kondrashov, F. A. & Koonin, E. V. No footprints of primordial introns in a eukaryotic
genome. Trends Genet. 16, 333±334 (2000).
330. Brunner, H. G., Nelen, M., Breake®eld, X. O., Ropers, H. H. & van Oost, B. B. A. Abnormal behavior
associated with a point mutation in the structural gene for monoamine oxidase A. Science 262, 578±
580 (1993).
331. Cases, O. et al. Aggressive behavior and altered amounts of brain serotonin and norepinephrine in
mice lacking MAOA. Science 268, 1763±1766 (1995).
332. Brunner, H. G. et al. X-linked borderline mental retardation with prominent behavioral disturbance:
phenotype, genetic localization, and evidence for disturbed monoamine metabolism. Am. J. Hum.
Genet. 52, 1032±1039 (1993).
333. Deckert, J. et al. Excess of high activity monoamine oxidase A gene promoter alleles in female
patients with panic disorder. Hum. Mol. Genet. 8, 621±624 (1999).
334. Smith, T. F. & Waterman, M. S. Identi®cation of common molecular subsequences. J. Mol. Biol. 147,
195±197 (1981).
335. Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278,
631±637 (1997).
336. Ponting, C. P., Aravind, L., Schultz, J., Bork, P. & Koonin, E. V. Eukaryotic signalling domain
homologues in archaea and bacteria. Ancient ancestry and horizontal gene transfer. J. Mol. Biol. 289,
729±745 (1999).
337. Zhang, J., Dyer, K. D. & Rosenberg, H. F. Evolution of the rodent eosinophil-associated Rnase gene
family by rapid gene sorting and positive selection. Proc. Natl Acad. Sci. USA 97, 4701±4706 (2000).
338. Shashoua, V. E. Ependymin, a brain extracellular glycoprotein, and CNS plasticity. Ann. NY Acad.
Sci. 627, 94±114 (1991).
339. Schultz, J., Copley, R. R., Doerks, T., Ponting, C. P. & Bork, P. SMART: a web-based tool for the study
of genetically mobile domains. Nucleic Acids Res. 28, 231±234 (2000).
340. Koonin, E. V., Aravind, L. & Kondrashov, A. S. The impact of comparative genomics on our
understanding of evolution. Cell 101, 573±576 (2000).
341. Bateman, A., Eddy, S. R. & Chothia, C. Members of the immunoglobulin superfamily in bacteria.
Protein Sci. 5, 1939±1941 (1996).
342. Sutherland, D., Samakovlis, C. & Krasnow, M. A. Branchless encodes a Drosophila FGF homolog
that controls tracheal cell migration and the pattern of branching. Cell 87, 1091±1101 (1996).
343. Warburton, D. et al. The molecular basis of lung morphogenesis. Mech. Dev. 92, 55±81 (2000).
344. Fuchs, T., Glusman, G., Horn-Saban, S., Lancet, D. & Pilpel, Y. The human olfactory subgenome:
from sequence to structure to evolution. Hum. Genet. 108, 1±13 (2001).
345. Glusman, G. et al. The olfactory receptor gene family: data mining, classi®cation and nomenclature.
Mamm. Genome 11, 1016±1023 (2000).
346. Rouquier, S. et al. Distribution of olfactory receptor genes in the human genome. Nature Genet. 18,
243±250 (1998).
347. Sharon, D. et al. Primate evolution of an olfactory receptor cluster: Diversi®cation by gene
conversion and recent emergence of a pseudogene. Genomics 61, 24±36 (1999).
348. Gilad, Y. et al. Dichotomy of single-nucleotide polymorphism haplotypes in olfactory receptor genes
and pseudogenes. Nature Genet. 26, 221±224 (2000).
349. Gearhart, J. & Kirschner, M. Cells, Embryos, and Evolution (Blackwell Science, Malden, Massachu-
setts, 1997).
350. Barbazuk, W. B. et al. The syntenic relationship of the zebra®sh and human genomes. Genome Res.
10, 1351±1358 (2000).
351. McLysaght, A., Enright, A. J., Skrabanek, L. & Wolfe, K. H. Estimation of synteny conservation and
genome compaction between puffer®sh (Fugu) and human. Yeast 17, 22±36 (2000).
352. Trachtulec, Z. et al. Linkage of TATA-binding protein and proteasome subunit C5 genes in mice and
humans reveals synteny conserved between mammals and invertebrates. Genomics 44, 1±7 (1997).
articles
918 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
Res. 19, 4147±4151 (1991).
266. Timofeeva, M. et al. [Organization of a 5S ribosomal RNA gene cluster in the human genome].
Mol. Biol. (Mosk.) 27, 861±868 (1993).
267. Little, R. D. & Braaten, D. C. Genomic organization of human 5S rDNA and sequence of one tandem
repeat. Genomics 4, 376±383 (1989).
268. Maden, B. E. H. The numerous modi®ed nucleotides in eukaryotic ribosomal RNA. Prog. Nucleic
Acid Res. Mol. Biol. 39, 241±303 (1990).
269. Tycowski, K. T., You, Z. H., Graham, P. J. & Steitz, J. A. Modi®cation of U6 spliceosomal RNA is
guided by other small RNAs. Mol. Cell 2, 629±638 (1998).
270. Pavelitz, T., Liao, D. Q. & Weiner, A. M. Concerted evolution of the tandem array encoding primate
U2 snRNA (the RNU2 locus) is accompanied by dramatic remodeling of the junctions with ¯anking
chromosomal sequences. EMBO J. 18, 3783±3792 (1999).
271. Lindgren, V., Ares, A., Weiner, A. M. & Francke, U. Human genes for U2 small nuclear RNA map to a
major adenovirus 12 modi®cation site on chromosome 17. Nature 314, 115±116 (1985).
272. Van Arsdell, S. W. & Weiner, A. M. Human genes for U2 small nuclear RNA are tandemly repeated.
Mol. Cell. Biol. 4, 492±499 (1984).
273. Gao, L. I., Frey, M. R. & Matera, A. G. Human genes encoding U3 snRNA associate with coiled
bodies in interphase cells and are clustered on chromosome 17p11. 2 in a complex inverted repeat
structure. Nucleic Acids Res. 25, 4740±4747 (1997).
274. Hawkins, J. D. A survey on intron and exon lengths. Nucleic Acids Res. 16, 9893±9908 (1988).
275. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol.
268, 78±94 (1997).
276. Labeit, S. & Kolmerer, B. Titins: giant proteins in charge of muscle ultrastructure and elasticity.
Science 270, 293±296 (1995).
277. Sterner, D. A., Carlo, T. & Berget, S. M. Architectural limits on split genes. Proc. Natl Acad. Sci. USA
93, 15081±15085 (1996).
278. Sun, Q., Mayeda, A., Hampson, R. K., Krainer, A. R. & Rottman, F. M. General splicing factor SF2/
ASF promotes alternative splicing by binding to an exonic splicing enhancer. Genes Dev. 7, 2598±
2608 (1993).
279. Tanaka, K., Watakabe, A. & Shimura, Y. Polypurine sequences within a downstream exon function as
a splicing enhancer. Mol. Cell. Biol. 14, 1347±1354 (1994).
280. Carlo, T., Sterner, D. A. & Berget, S. M. An intron splicing enhancer containing a G-rich repeat
facilitates inclusion of a vertebrate micro-exon. RNA 2, 342±353 (1996).
281. Burset, M., Seledtsov, I. A. & Solovyev, V. V. Analysis of canonical and non-canonical splice sites in
mammalian genomes. Nucleic Acids Res. 28, 4364±4375 (2000).
282. Burge, C. B., Padgett, R. A. & Sharp, P. A. Evolutionary fates and origins of U12-type introns.
Mol. Cell 2, 773±785 (1998).
283. Mironov, A. A., Fickett, J. W. & Gelfand, M. S. Frequent alternative splicing of human genes. Genome
Res. 9, 1288±1293 (1999).
284. Hanke, J. et al. Alternative splicing of human genes: more the rule than the exception? Trends Genet.
15, 389±390 (1999).
285. Brett, D. et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice
forms. FEBS Lett. 474, 83±86 (2000).
286. Dunham, I. The gene guessing game. Yeast 17, 218±224 (2000).
287. Lewin, B. Gene Expression (Wiley, New York, 1980).
288. Lewin, B. Genes IV 466±481 (Oxford Univ. Press, Oxford, 1990).
289. Smaglik, P. Researchers take a gamble on the human genome. Nature 405, 264 (2000).
290. Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human genome? Nature
Genet. 7, 345±346 (1994).
291. Liang, F. et al. Gene index analysis of the human genome estimates approximately 120,000 genes.
Nature Genet. 25, 239±240 (2000).
292. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using
Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235±238 (2000).
293. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: A platform
for investigating biology. Science 282, 2012±2018 (1998).
294. Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204±2215 (2000).
295. Green, P. et al. Ancient conserved regions in new gene sequences and the protein databases. Science
259, 1711±1716 (1993).
296. Fraser, A. G. et al. Functional genomic analysis of C. elegans chromosome I by systematic RNA
interference. Nature 408, 325±330 (2000).
297. Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA.
Comput. Appl. Biosci. 13, 477±478 (1997).
298. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. A computer program for aligning a
cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967±974 (1998).
299. Bailey, L. C. Jr, Searls, D. B. & Overton, G. C. Analysis of EST-driven gene annotation in human
genomic sequence. Genome Res. 8, 362±376 (1998).
300. Birney, E., Thompson, J. D. & Gibson, T. J. PairWise and SearchWise: ®nding the optimal alignment
in a simultaneous comparison of a protein pro®le against all DNA translation frames. Nucleic Acids
Res. 24, 2730±2739 (1996).
301. Gelfand, M. S., Mironov, A. A. & Pevzner, P. A. Gene recognition via spliced sequence alignment.
Proc. Natl Acad. Sci. USA 93, 9061±9066 (1996).
302. Kulp, D., Haussler, D., Reese, M. G. & Eeckman, F. H. A generalized hidden Markov model for the
recognition of human genes in DNA. ISMB 4, 134±142 (1996).
303. Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. GenieÐgene ®nding in Drosophila
melanogaster. Genome Res. 10, 529±538 (2000).
304. Solovyev, V. & Salamov, A. The Gene-Finder computer tools for analysis of human and model
organisms genome sequences. ISMB 5, 294±302 (1997).
305. Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of gene prediction
accuracy in large DNA sequences. Genome Res. 10, 1631±1642 (2000).
306. Hubbard, T. & Birney, E. Open annotation offers a democratic solution to genome sequencing.
Nature 403, 825 (2000).
307. Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 28, 263±266 (2000).
308. Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10,
547±548 (2000).
309. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium.
Functional annotation of a full-length mouse cDNA collection. Nature 409, 685±690 (2001).
310. Basrai, M. A., Hieter, P. & Boeke, J. D. Small open reading frames: beautiful needles in the haystack.
Genome Res. 7, 768±771 (1997).
311. Janin, J. & Chothia, C. Domains in proteins: de®nitions, location, and structural principles. Methods
Enzymol. 115, 420±430 (1985).
312. Ponting, C. P., Schultz, J., Copley, R. R., Andrade, M. A. & Bork, P. Evolution of domain families.
Adv. Protein Chem. 54, 185±244 (2000).
313. Doolittle, R. F. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64, 287±314 (1995).
314. Bateman, A. & Birney, E. Searching databases to ®nd protein domain organization. Adv. Protein
Chem. 54, 137±157 (2000).
315. Futreal, P. A. et al. Cancer and genomics. Nature 409, 850±852 (2001).
316. Nestler, E. J. & Landsman, D. Learning about addiction from the human draft genome. Nature 409,
834±835 (2001).
317. Tupler, R., Perini, G. & Green, M. R. Expressing the human genome. Nature 409, 832±835 (2001).
318. Fahrer, A. M., Bazan, J. F., Papathanasiou, P., Nelms, K. A. & Goodnow, C. C. A genomic view of
immunology. Nature 409, 836±838 (2001).
319. Li, W. -H., Gu, Z., Wang, H. & Nekrutenko, A. Evolutionary analyses of the human genome. Nature
409, 847±849 (2001).
320. Bock, J. B., Matern, H. T., Peden, A. A. & Scheller, R. H. A genomic perspective on membrane
compartment organization. Nature 409, 839±841 (2001).
321. Pollard, T. D. Genomics, the cytoskeleton and motility. Nature 409, 842±843 (2001).
322. Murray, A. W. & Marks, D. Can sequencing shed light on cell cycling? Nature 409, 844±846 (2001).
323. Clayton, J. D., Kyriacou, C. P. & Reppert, S. M. Keeping time with the human genome. Nature 409,
829±831 (2001).
324. Chervitz, S. A. et al. Comparison of the complete protein sets of worm and yeast: orthology and
divergence. Science 282, 2022±2028 (1998).
325. Aravind, L. & Subramanian, G. Origin of multicellular eukaryotesÐinsights from proteome
comparisons. Curr. Opin. Genet. Dev. 9, 688±694 (1999).
326. Attwood, T. K. et al. PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28, 225±
227 (2000).
327. Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic
Acids Res. 27, 215±219 (1999).
328. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389±3402 (1997).
329. Wolf, Y. I., Kondrashov, F. A. & Koonin, E. V. No footprints of primordial introns in a eukaryotic
genome. Trends Genet. 16, 333±334 (2000).
330. Brunner, H. G., Nelen, M., Breake®eld, X. O., Ropers, H. H. & van Oost, B. B. A. Abnormal behavior
associated with a point mutation in the structural gene for monoamine oxidase A. Science 262, 578±
580 (1993).
331. Cases, O. et al. Aggressive behavior and altered amounts of brain serotonin and norepinephrine in
mice lacking MAOA. Science 268, 1763±1766 (1995).
332. Brunner, H. G. et al. X-linked borderline mental retardation with prominent behavioral disturbance:
phenotype, genetic localization, and evidence for disturbed monoamine metabolism. Am. J. Hum.
Genet. 52, 1032±1039 (1993).
333. Deckert, J. et al. Excess of high activity monoamine oxidase A gene promoter alleles in female
patients with panic disorder. Hum. Mol. Genet. 8, 621±624 (1999).
334. Smith, T. F. & Waterman, M. S. Identi®cation of common molecular subsequences. J. Mol. Biol. 147,
195±197 (1981).
335. Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278,
631±637 (1997).
336. Ponting, C. P., Aravind, L., Schultz, J., Bork, P. & Koonin, E. V. Eukaryotic signalling domain
homologues in archaea and bacteria. Ancient ancestry and horizontal gene transfer. J. Mol. Biol. 289,
729±745 (1999).
337. Zhang, J., Dyer, K. D. & Rosenberg, H. F. Evolution of the rodent eosinophil-associated Rnase gene
family by rapid gene sorting and positive selection. Proc. Natl Acad. Sci. USA 97, 4701±4706 (2000).
338. Shashoua, V. E. Ependymin, a brain extracellular glycoprotein, and CNS plasticity. Ann. NY Acad.
Sci. 627, 94±114 (1991).
339. Schultz, J., Copley, R. R., Doerks, T., Ponting, C. P. & Bork, P. SMART: a web-based tool for the study
of genetically mobile domains. Nucleic Acids Res. 28, 231±234 (2000).
340. Koonin, E. V., Aravind, L. & Kondrashov, A. S. The impact of comparative genomics on our
understanding of evolution. Cell 101, 573±576 (2000).
341. Bateman, A., Eddy, S. R. & Chothia, C. Members of the immunoglobulin superfamily in bacteria.
Protein Sci. 5, 1939±1941 (1996).
342. Sutherland, D., Samakovlis, C. & Krasnow, M. A. Branchless encodes a Drosophila FGF homolog
that controls tracheal cell migration and the pattern of branching. Cell 87, 1091±1101 (1996).
343. Warburton, D. et al. The molecular basis of lung morphogenesis. Mech. Dev. 92, 55±81 (2000).
344. Fuchs, T., Glusman, G., Horn-Saban, S., Lancet, D. & Pilpel, Y. The human olfactory subgenome:
from sequence to structure to evolution. Hum. Genet. 108, 1±13 (2001).
345. Glusman, G. et al. The olfactory receptor gene family: data mining, classi®cation and nomenclature.
Mamm. Genome 11, 1016±1023 (2000).
346. Rouquier, S. et al. Distribution of olfactory receptor genes in the human genome. Nature Genet. 18,
243±250 (1998).
347. Sharon, D. et al. Primate evolution of an olfactory receptor cluster: Diversi®cation by gene
conversion and recent emergence of a pseudogene. Genomics 61, 24±36 (1999).
348. Gilad, Y. et al. Dichotomy of single-nucleotide polymorphism haplotypes in olfactory receptor genes
and pseudogenes. Nature Genet. 26, 221±224 (2000).
349. Gearhart, J. & Kirschner, M. Cells, Embryos, and Evolution (Blackwell Science, Malden, Massachu-
setts, 1997).
350. Barbazuk, W. B. et al. The syntenic relationship of the zebra®sh and human genomes. Genome Res.
10, 1351±1358 (2000).
351. McLysaght, A., Enright, A. J., Skrabanek, L. & Wolfe, K. H. Estimation of synteny conservation and
genome compaction between puffer®sh (Fugu) and human. Yeast 17, 22±36 (2000).
352. Trachtulec, Z. et al. Linkage of TATA-binding protein and proteasome subunit C5 genes in mice and
humans reveals synteny conserved between mammals and invertebrates. Genomics 44, 1±7 (1997).
articles
918 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd
Page 60
353. Nadeau, J. H. Maps of linkage and synteny homologies between mouse and man. Trends Genet. 5,
82±86 (1989).
354. Nadeau, J. H. & Taylor, B. A. Lengths of chromosomal segments conserved since divergence of man
and mouse. Proc. Natl Acad. Sci. USA 81, 814±818 (1984).
355. Copeland, N. G. et al. A genetic linkage map of the mouse: current applications and future prospects.
Science 262, 57±66 (1993).
356. DeBry, R. W. & Seldin, M. F. Human/mouse homology relationships. Genomics 33, 337±351 (1996).
357. Nadeau, J. H. & Sankoff, D. The lengths of undiscovered conserved segments in comparative maps.
Mamm. Genome 9, 491±495 (1998).
358. Thomas, J. W. et al. Comparative genome mapping in the sequence-based era: early experience with
human chromosome 7. Genome Res. 10, 624±633 (2000).
359. Pletcher, M. T. et al. Chromosome evolution: The junction of mammalian chromosomes in the
formation of mouse chromosome 10. Genome Res. 10, 1463±1467 (2000).
360. Novacek, M. J. Mammalian phylogeny: shaking the tree. Nature 356, 121±125 (1992).
361. O'Brien, S. J. et al. Genome maps 10. Comparative genomics. Mammalian radiations. Wall chart.
Science 286, 463±478 (1999).
362. Romer, A. S. Vertebrate Paleontology (Univ. Chicago Press, Chicago and New York, 1966).
363. Paterson, A. H. et al. Toward a uni®ed genetic map of higher plants, transcending the monocot-dicot
divergence. Nature Genet. 14, 380±382 (1996).
364. Jenczewski, E., Prosperi, J. M. & Ronfort, J. Differentiation between natural and cultivated
populations of Medicago sativa (Leguminosae) from Spain: analysis with random ampli®ed
polymorphic DNA (RAPD) markers and comparison to allozymes. Mol. Ecol. 8, 1317±1330 (1999).
365. Ohno, S. Evolution by Gene Duplication (George Allen and Unwin, London, 1970).
366. Wolfe, K. H. & Shields, D. C. Molecular evidence for an ancient duplication of the entire yeast
genome. Nature 387, 708±713 (1997).
367. Blanc, G., Barakat, A., Guyot, R., Cooke, R. & Delseny, M. Extensive duplication and reshuf¯ing in
the arabidopsis genome. Plant Cell 12, 1093±1102 (2000).
368. Paterson, A. H. et al. Comparative genomics of plant chromosomes. Plant Cell 12, 1523±1540 (2000).
369. Vision, T., Brown, D. & Tanksley, S. The origins of genome duplications in Arabidopsis. Science 290,
2114±2117 (2000).
370. Sidow, A. & Bowman, B. H. Molecular phylogeny. Curr. Opin. Genet. Dev. 1, 451±456 (1991).
371. Sidow, A. & Thomas, W. K. A molecular evolutionary framework for eukaryotic model organisms.
Curr. Biol. 4, 596±603 (1994).
372. Sidow, A. Gen(om)e duplications in the evolution of early vertebrates. Curr. Opin. Genet. Dev. 6,
715±722 (1996).
373. Spring, J. Vertebrate evolution by interspeci®c hybridisationÐare we polyploid? FEBS Lett. 400, 2±8
(1997).
374. Skrabanek, L. & Wolfe, K. H. Eukaryote genome duplicationÐwhere's the evidence? Curr. Opin.
Genet. Dev. 8, 694±700 (1998).
375. Hughes, A. L. Phylogenies of developmentally important proteins do not support the hypothesis of
two rounds of genome duplication early in vertebrate history. J. Mol. Evol. 48, 565±576 (1999).
376. Lander, E. S. & Schork, N. J. Genetic dissection of complex traits. Science 265, 2037±2048 (1994).
377. Horikawa, Y. et al. Genetic variability in the gene encoding calpain-10 is associated with type 2
diabetes mellitus. Nature Genet. 26, 163±175 (2000).
378. Hastbacka, J. et al. The diastrophic dysplasia gene encodes a novel sulfate transporter: positional
cloning by ®ne-structure linkage disequilibrium mapping. Cell 78, 1073±1087 (1994).
379. Tischkoff, S. A. et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human
origins. Science 271, 1380±1387 (1996).
380. Kidd, J. R. et al. Haplotypes and linkage disequilibrium at the phenylalanine hydroxylase locus PAH,
in a global representation of populations. Am. J. Hum. Genet. 63, 1882±1899 (2000).
381. Mateu, E. et al. Worldwide genetic analysis of the CFTR region. Am. J. Hum. Genet. 68, 103±117
(2001).
382. Abecasis, G. R. et al. Extent and distribution of linkage disequilibrium in three genomic regions.
Am. J. Hum. Genet. 68, 191±197 (2001).
383. Taillon-Miller, P. et al. Juxtaposed regions of extensive and minimal linkage disequilibrium in Xq25
and Xq28. Nature Genet. 25, 324±328 (2000).
384. Martin, E. R. et al. SNPing away at complex diseases: analysis of single-nucleotide polymorphisms
around APOE in Alzheimer disease. Am. J. Hum. Genet. 67, 383±394 (2000).
385. Collins, A., Lonjou, C. & Morton, N. E. Genetic epidemiology of single-nucleotide polymorphisms.
Proc. Natl Acad. Sci. USA 96, 15173±15177 (1999).
386. Dunning, A. M. et al. The extent of linkage disequilibrium in four populations with distinct
demographic histories. Am. J. Hum. Genet. 67, 1544±1554 (2000).
387. Rieder, M. J., Taylor, S. L., Clark, A. G. & Nickerson, D. A. Sequence variation in the human
angiotensin converting enzyme. Nature Genet. 22, 59±62 (1999).
388. Collins, F. S. Positional cloning moves from perditional to traditional. Nature Genet. 9, 347±350
(1995).
389. Nagamine, K. et al. Positional cloning of the APECED gene. Nature Genet. 17, 393±398 (1997).
390. Reuber, B. E. et al. Mutations in PEX1 are the most common cause of peroxisome biogenesis
disorders. Nature Genet. 17, 445±448 (1997).
391. Portsteffen, H. et al. Human PEX1 is mutated in complementation group 1 of the peroxisome
biogenesis disorders. Nature Genet. 17, 449±452 (1997).
392. Everett, L. A. et al. Pendred syndrome is caused by mutations in a putative sulphate transporter gene
(PDS). Nature Genet. 17, 411±422 (1997).
393. Coffey, A. J. et al. Host response to EBV infection in X-linked lymphoproliferative disease results
from mutations in an SH2-domain encoding gene. Nature Genet. 20, 129±135 (1998).
394. Van Laer, L. et al. Nonsyndromic hearing impairment is associated with a mutation in DFNA5.
Nature Genet. 20, 194±197 (1998).
395. Sakuntabhai, A. et al. Mutations in ATP2A2, encoding a Ca2+ pump, cause Darier disease. Nature
Genet. 21, 271±277 (1999).
396. Gedeon, A. K. et al. Identi®cation of the gene (SEDL) causing X-linked spondyloepiphyseal
dysplasia tarda. Nature Genet. 22, 400±404 (1999).
397. Hurvitz, J. R. et al. Mutations in the CCN gene family member WISP3 cause progressive
pseudorheumatoid dysplasia. Nature Genet. 23, 94±98 (1999).
398. Laberge-le Couteulx, S. et al. Truncating mutations in CCM1, encoding KRIT1, cause hereditary
cavernous angiomas. Nature Genet. 23, 189±193 (1999).
399. Sahoo, T. et al. Mutations in the gene encoding KRIT1, a Krev-1/rap1a binding protein, cause
cerebral cavernous malformations (CCM1). Hum. Mol. Genet. 8, 2325±2333 (1999).
400. McGuirt, W. T. et al. Mutations in COL11A2 cause non-syndromic hearing loss (DFNA13). Nature
Genet. 23, 413±419 (1999).
401. Moreira, E. S. et al. Limb-girdle muscular dystrophy type 2G is caused by mutations in the gene
encoding the sarcomeric protein telethonin. Nature Genet. 24, 163±166 (2000).
402. Ruiz-Perez, V. L. et al. Mutations in a new gene in Ellis-van Creveld syndrome and Weyers acrodental
dysostosis. Nature Genet. 24, 283±286 (2000).
403. Kaplan, J. M. et al. Mutations in ACTN4, encoding alpha-actinin-4, cause familial focal segmental
glomerulosclerosis. Nature Genet. 24, 251±256 (2000).
404. Escayg, A. et al. Mutations of SCN1A, encoding a neuronal sodium channel, in two families with
GEFS+2. Nature Genet. 24, 343±345 (2000).
405. Sacksteder, K. A. et al. Identi®cation of the alpha-aminoadipic semialdehyde synthase gene, which is
defective in familial hyperlysinemia. Am. J. Hum. Genet. 66, 1736±1743 (2000).
406. Kalaydjieva, L. et al. N-myc downstream-regulated gene 1 is mutated in hereditary motor and
sensory neuropathy-Lom. Am. J. Hum. Genet. 67, 47±58 (2000).
407. Sundin, O. H. et al. Genetic basis of total colourblindness among the Pingelapese islanders. Nature
Genet. 25, 289±293 (2000).
408. Kohl, S. et al. Mutations in the CNGB3 gene encoding the beta-subunit of the cone photoreceptor
cGMP-gated channel are responsible for achromatopsia (ACHM3) linked to chromosome 8q21.
Hum. Mol. Genet. 9, 2107±2116 (2000).
409. Avela, K. et al. Gene encoding a new RING-B-box-coiled-coil protein is mutated in mulibrey
nanism. Nature Genet. 25, 298±301 (2000).
410. Verpy, E. et al. A defect in harmonin, a PDZ domain-containing protein expressed in the inner ear
sensory hair cells, underlies usher syndrome type 1C. Nature Genet. 26, 51±55 (2000).
411. Bitner-Glindzicz, M. et al. A recessive contiguous gene deletion causing infantile hyperinsulinism,
enteropathy and deafness identi®es the usher type 1C gene. Nature Genet. 26, 56±60 (2000).
412. The May-Hegglin/Fetchner Syndrome Consortium. Mutations in MYH9 result in the May-Hegglin
anomaly, and Fechtner and Sebastian syndromes. Nature Genet. 26, 103±105 (2000).
413. Kelley, M. J., Jawien, W., Ortel, T. L. & Korczak, J. F. Mutation of MYH9, encoding non-muscle
myosin heavy chain A, in May-Hegglin anomaly. Nature Genet. 26, 106±108 (2000).
414. Kirschner, L. S. et al. Mutations of the gene encoding the protein kinase A type I-a regulatory
subunit in patients with the Carney complex. Nature Genet. 26, 89±92 (2000).
415. Lalwani, A. K. et al. Human nonsyndromic hereditary deafness DFNA17 is due to a mutation in
non-muscle myosin MYH9. Am. J. Hum. Genet. 67, 1121±1128 (2000).
416. Matsuura, T. et al. Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia
type 10. Nature Genet. 26, 191±194 (2000).
417. Delettre, C. et al. Nuclear gene OPA1, encoding a mitochondrial dynamin-related protein, is
mutated in dominant optic atrophy. Nature Genet. 26, 207±210 (2000).
418. Pusch, C. M. et al. The complete form of X-linked congenital stationary night blindness is caused by
mutations in a gene encoding a leucine-rich repeat protein. Nature Genet. 26, 324±327 (2000).
419. The ADHR Consortium. Autosomal dominant hypophosphataemic rickets is associated with
mutations in FGF23. Nature Genet. 26, 345±348 (2000).
420. Bomont, P. et al. The gene encoding gigaxonin, a new member of the cytoskeletal BTB/kelch repeat
family, is mutated in giant axonal neuropathy. Nature Genet. 26, 370±374 (2000).
421. Tullio-Pelet, A. et al. Mutant WD-repeat protein in triple-A syndrome. Nature Genet. 26, 332±335
(2000).
422. Nicole, S. et al. Perlecan, the major proteoglycan of basement membranes, is altered in patients with
Schwartz-Jampel syndrome (chondrodystrophic myotonia). Nature Genet. 26, 480±483 (2000).
423. Rogaev, E. I. et al. Familial Alzheimer's disease in kindreds with missense mutations in a gene on
chromosome 1 related to the Alzheimer's disease type 3 gene. Nature 376, 775±778 (1995).
424. Sherrington, R. et al. Cloning of a gene bearing missense mutations in early-onset familial
Alzheimer's disease. Nature 375, 754±760 (1995).
425. Olivieri, N. F. & Weatherall, D. J. The therapeutic reactivation of fetal haemoglobin. Hum. Mol.
Genet. 7, 1655±1658 (1998).
426. Drews, J. Research & development. Basic science and pharmaceutical innovation. Nature Biotechnol.
17, 406 (1999).
427. Drews, J. Drug discovery: a historical perspective. Science 287, 1960±1964 (2000).
428. Davies, P. A. et al. The 5-HT3B subunit is a major determinant of serotonin-receptor function.
Nature 397, 359±363 (1999).
429. Heise, C. E. et al. Characterization of the human cysteinyl leukotriene 2 receptor. J. Biol. Chem. 275,
30531±30536 (2000).
430. Fan, W. et al. BACE maps to chromosome 11 and a BACE homolog, BACE2, reside in the obligate
Down Syndrome region of chromosome 21. Science 286, 1255a (1999).
431. Saunders, A. J., Kim, T. -W. & Tanzi, R. E. BACE maps to chromosome 11 and a BACE homolog,
BACE2, reside in the obligate Down Syndrome region of chromosome 21. Science 286, 1255a (1999).
432. Firestein, S. The good taste of genomics. Nature 404, 552±553 (2000).
433. Matsunami, H., Montmayeur, J. P. & Buck, L. B. A family of candidate taste receptors in human and
mouse. Nature 404, 601±604 (2000).
434. Adler, E. et al. A novel family of mammalian taste receptors. Cell 100, 693±702 (2000).
435. Chandrashekar, J. et al. T2Rs function as bitter taste receptors. Cell 100, 703±711 (2000).
436. Hardison, R. C. Conserved non-coding sequences are reliable guides to regulatory elements. Trends
Genet. 16, 369±372 (2000).
437. Onyango, P. et al. Sequence and comparative analysis of the mouse 1-megabase region orthologous
to the human 11p15 imprinted domain. Genome Res. 10, 1697±1710 (2000).
438. Bouck, J. B., Metzker, M. L. & Gibbs, R. A. Shotgun sample sequence comparisons between mouse
and human genomes. Nature Genet. 25, 31±33 (2000).
439. Marshall, E. Public-private project to deliver mouse genome in 6 months. Science 290, 242±243 (2000).
440. Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, C. E. Human-mouse
genome comparisons to locate regulatory sites. Nature Genet. 26, 225±228 (2000).
441. Tagle, D. A. et al. Embryonic epsilon and gamma globin genes of a prosimian primate (Galago
crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic
footprints. J. Mol. Biol. 203, 439±455 (1988).
442. McGuire, A. M., Hughes, J. D. & Church, G. M. Conservation of DNA regulatory motifs and
discovery of new motifs in microbial genomes. Genome Res. 10, 744±757 (2000).
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 919© 2001 Macmillan Magazines Ltd
82±86 (1989).
354. Nadeau, J. H. & Taylor, B. A. Lengths of chromosomal segments conserved since divergence of man
and mouse. Proc. Natl Acad. Sci. USA 81, 814±818 (1984).
355. Copeland, N. G. et al. A genetic linkage map of the mouse: current applications and future prospects.
Science 262, 57±66 (1993).
356. DeBry, R. W. & Seldin, M. F. Human/mouse homology relationships. Genomics 33, 337±351 (1996).
357. Nadeau, J. H. & Sankoff, D. The lengths of undiscovered conserved segments in comparative maps.
Mamm. Genome 9, 491±495 (1998).
358. Thomas, J. W. et al. Comparative genome mapping in the sequence-based era: early experience with
human chromosome 7. Genome Res. 10, 624±633 (2000).
359. Pletcher, M. T. et al. Chromosome evolution: The junction of mammalian chromosomes in the
formation of mouse chromosome 10. Genome Res. 10, 1463±1467 (2000).
360. Novacek, M. J. Mammalian phylogeny: shaking the tree. Nature 356, 121±125 (1992).
361. O'Brien, S. J. et al. Genome maps 10. Comparative genomics. Mammalian radiations. Wall chart.
Science 286, 463±478 (1999).
362. Romer, A. S. Vertebrate Paleontology (Univ. Chicago Press, Chicago and New York, 1966).
363. Paterson, A. H. et al. Toward a uni®ed genetic map of higher plants, transcending the monocot-dicot
divergence. Nature Genet. 14, 380±382 (1996).
364. Jenczewski, E., Prosperi, J. M. & Ronfort, J. Differentiation between natural and cultivated
populations of Medicago sativa (Leguminosae) from Spain: analysis with random ampli®ed
polymorphic DNA (RAPD) markers and comparison to allozymes. Mol. Ecol. 8, 1317±1330 (1999).
365. Ohno, S. Evolution by Gene Duplication (George Allen and Unwin, London, 1970).
366. Wolfe, K. H. & Shields, D. C. Molecular evidence for an ancient duplication of the entire yeast
genome. Nature 387, 708±713 (1997).
367. Blanc, G., Barakat, A., Guyot, R., Cooke, R. & Delseny, M. Extensive duplication and reshuf¯ing in
the arabidopsis genome. Plant Cell 12, 1093±1102 (2000).
368. Paterson, A. H. et al. Comparative genomics of plant chromosomes. Plant Cell 12, 1523±1540 (2000).
369. Vision, T., Brown, D. & Tanksley, S. The origins of genome duplications in Arabidopsis. Science 290,
2114±2117 (2000).
370. Sidow, A. & Bowman, B. H. Molecular phylogeny. Curr. Opin. Genet. Dev. 1, 451±456 (1991).
371. Sidow, A. & Thomas, W. K. A molecular evolutionary framework for eukaryotic model organisms.
Curr. Biol. 4, 596±603 (1994).
372. Sidow, A. Gen(om)e duplications in the evolution of early vertebrates. Curr. Opin. Genet. Dev. 6,
715±722 (1996).
373. Spring, J. Vertebrate evolution by interspeci®c hybridisationÐare we polyploid? FEBS Lett. 400, 2±8
(1997).
374. Skrabanek, L. & Wolfe, K. H. Eukaryote genome duplicationÐwhere's the evidence? Curr. Opin.
Genet. Dev. 8, 694±700 (1998).
375. Hughes, A. L. Phylogenies of developmentally important proteins do not support the hypothesis of
two rounds of genome duplication early in vertebrate history. J. Mol. Evol. 48, 565±576 (1999).
376. Lander, E. S. & Schork, N. J. Genetic dissection of complex traits. Science 265, 2037±2048 (1994).
377. Horikawa, Y. et al. Genetic variability in the gene encoding calpain-10 is associated with type 2
diabetes mellitus. Nature Genet. 26, 163±175 (2000).
378. Hastbacka, J. et al. The diastrophic dysplasia gene encodes a novel sulfate transporter: positional
cloning by ®ne-structure linkage disequilibrium mapping. Cell 78, 1073±1087 (1994).
379. Tischkoff, S. A. et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human
origins. Science 271, 1380±1387 (1996).
380. Kidd, J. R. et al. Haplotypes and linkage disequilibrium at the phenylalanine hydroxylase locus PAH,
in a global representation of populations. Am. J. Hum. Genet. 63, 1882±1899 (2000).
381. Mateu, E. et al. Worldwide genetic analysis of the CFTR region. Am. J. Hum. Genet. 68, 103±117
(2001).
382. Abecasis, G. R. et al. Extent and distribution of linkage disequilibrium in three genomic regions.
Am. J. Hum. Genet. 68, 191±197 (2001).
383. Taillon-Miller, P. et al. Juxtaposed regions of extensive and minimal linkage disequilibrium in Xq25
and Xq28. Nature Genet. 25, 324±328 (2000).
384. Martin, E. R. et al. SNPing away at complex diseases: analysis of single-nucleotide polymorphisms
around APOE in Alzheimer disease. Am. J. Hum. Genet. 67, 383±394 (2000).
385. Collins, A., Lonjou, C. & Morton, N. E. Genetic epidemiology of single-nucleotide polymorphisms.
Proc. Natl Acad. Sci. USA 96, 15173±15177 (1999).
386. Dunning, A. M. et al. The extent of linkage disequilibrium in four populations with distinct
demographic histories. Am. J. Hum. Genet. 67, 1544±1554 (2000).
387. Rieder, M. J., Taylor, S. L., Clark, A. G. & Nickerson, D. A. Sequence variation in the human
angiotensin converting enzyme. Nature Genet. 22, 59±62 (1999).
388. Collins, F. S. Positional cloning moves from perditional to traditional. Nature Genet. 9, 347±350
(1995).
389. Nagamine, K. et al. Positional cloning of the APECED gene. Nature Genet. 17, 393±398 (1997).
390. Reuber, B. E. et al. Mutations in PEX1 are the most common cause of peroxisome biogenesis
disorders. Nature Genet. 17, 445±448 (1997).
391. Portsteffen, H. et al. Human PEX1 is mutated in complementation group 1 of the peroxisome
biogenesis disorders. Nature Genet. 17, 449±452 (1997).
392. Everett, L. A. et al. Pendred syndrome is caused by mutations in a putative sulphate transporter gene
(PDS). Nature Genet. 17, 411±422 (1997).
393. Coffey, A. J. et al. Host response to EBV infection in X-linked lymphoproliferative disease results
from mutations in an SH2-domain encoding gene. Nature Genet. 20, 129±135 (1998).
394. Van Laer, L. et al. Nonsyndromic hearing impairment is associated with a mutation in DFNA5.
Nature Genet. 20, 194±197 (1998).
395. Sakuntabhai, A. et al. Mutations in ATP2A2, encoding a Ca2+ pump, cause Darier disease. Nature
Genet. 21, 271±277 (1999).
396. Gedeon, A. K. et al. Identi®cation of the gene (SEDL) causing X-linked spondyloepiphyseal
dysplasia tarda. Nature Genet. 22, 400±404 (1999).
397. Hurvitz, J. R. et al. Mutations in the CCN gene family member WISP3 cause progressive
pseudorheumatoid dysplasia. Nature Genet. 23, 94±98 (1999).
398. Laberge-le Couteulx, S. et al. Truncating mutations in CCM1, encoding KRIT1, cause hereditary
cavernous angiomas. Nature Genet. 23, 189±193 (1999).
399. Sahoo, T. et al. Mutations in the gene encoding KRIT1, a Krev-1/rap1a binding protein, cause
cerebral cavernous malformations (CCM1). Hum. Mol. Genet. 8, 2325±2333 (1999).
400. McGuirt, W. T. et al. Mutations in COL11A2 cause non-syndromic hearing loss (DFNA13). Nature
Genet. 23, 413±419 (1999).
401. Moreira, E. S. et al. Limb-girdle muscular dystrophy type 2G is caused by mutations in the gene
encoding the sarcomeric protein telethonin. Nature Genet. 24, 163±166 (2000).
402. Ruiz-Perez, V. L. et al. Mutations in a new gene in Ellis-van Creveld syndrome and Weyers acrodental
dysostosis. Nature Genet. 24, 283±286 (2000).
403. Kaplan, J. M. et al. Mutations in ACTN4, encoding alpha-actinin-4, cause familial focal segmental
glomerulosclerosis. Nature Genet. 24, 251±256 (2000).
404. Escayg, A. et al. Mutations of SCN1A, encoding a neuronal sodium channel, in two families with
GEFS+2. Nature Genet. 24, 343±345 (2000).
405. Sacksteder, K. A. et al. Identi®cation of the alpha-aminoadipic semialdehyde synthase gene, which is
defective in familial hyperlysinemia. Am. J. Hum. Genet. 66, 1736±1743 (2000).
406. Kalaydjieva, L. et al. N-myc downstream-regulated gene 1 is mutated in hereditary motor and
sensory neuropathy-Lom. Am. J. Hum. Genet. 67, 47±58 (2000).
407. Sundin, O. H. et al. Genetic basis of total colourblindness among the Pingelapese islanders. Nature
Genet. 25, 289±293 (2000).
408. Kohl, S. et al. Mutations in the CNGB3 gene encoding the beta-subunit of the cone photoreceptor
cGMP-gated channel are responsible for achromatopsia (ACHM3) linked to chromosome 8q21.
Hum. Mol. Genet. 9, 2107±2116 (2000).
409. Avela, K. et al. Gene encoding a new RING-B-box-coiled-coil protein is mutated in mulibrey
nanism. Nature Genet. 25, 298±301 (2000).
410. Verpy, E. et al. A defect in harmonin, a PDZ domain-containing protein expressed in the inner ear
sensory hair cells, underlies usher syndrome type 1C. Nature Genet. 26, 51±55 (2000).
411. Bitner-Glindzicz, M. et al. A recessive contiguous gene deletion causing infantile hyperinsulinism,
enteropathy and deafness identi®es the usher type 1C gene. Nature Genet. 26, 56±60 (2000).
412. The May-Hegglin/Fetchner Syndrome Consortium. Mutations in MYH9 result in the May-Hegglin
anomaly, and Fechtner and Sebastian syndromes. Nature Genet. 26, 103±105 (2000).
413. Kelley, M. J., Jawien, W., Ortel, T. L. & Korczak, J. F. Mutation of MYH9, encoding non-muscle
myosin heavy chain A, in May-Hegglin anomaly. Nature Genet. 26, 106±108 (2000).
414. Kirschner, L. S. et al. Mutations of the gene encoding the protein kinase A type I-a regulatory
subunit in patients with the Carney complex. Nature Genet. 26, 89±92 (2000).
415. Lalwani, A. K. et al. Human nonsyndromic hereditary deafness DFNA17 is due to a mutation in
non-muscle myosin MYH9. Am. J. Hum. Genet. 67, 1121±1128 (2000).
416. Matsuura, T. et al. Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia
type 10. Nature Genet. 26, 191±194 (2000).
417. Delettre, C. et al. Nuclear gene OPA1, encoding a mitochondrial dynamin-related protein, is
mutated in dominant optic atrophy. Nature Genet. 26, 207±210 (2000).
418. Pusch, C. M. et al. The complete form of X-linked congenital stationary night blindness is caused by
mutations in a gene encoding a leucine-rich repeat protein. Nature Genet. 26, 324±327 (2000).
419. The ADHR Consortium. Autosomal dominant hypophosphataemic rickets is associated with
mutations in FGF23. Nature Genet. 26, 345±348 (2000).
420. Bomont, P. et al. The gene encoding gigaxonin, a new member of the cytoskeletal BTB/kelch repeat
family, is mutated in giant axonal neuropathy. Nature Genet. 26, 370±374 (2000).
421. Tullio-Pelet, A. et al. Mutant WD-repeat protein in triple-A syndrome. Nature Genet. 26, 332±335
(2000).
422. Nicole, S. et al. Perlecan, the major proteoglycan of basement membranes, is altered in patients with
Schwartz-Jampel syndrome (chondrodystrophic myotonia). Nature Genet. 26, 480±483 (2000).
423. Rogaev, E. I. et al. Familial Alzheimer's disease in kindreds with missense mutations in a gene on
chromosome 1 related to the Alzheimer's disease type 3 gene. Nature 376, 775±778 (1995).
424. Sherrington, R. et al. Cloning of a gene bearing missense mutations in early-onset familial
Alzheimer's disease. Nature 375, 754±760 (1995).
425. Olivieri, N. F. & Weatherall, D. J. The therapeutic reactivation of fetal haemoglobin. Hum. Mol.
Genet. 7, 1655±1658 (1998).
426. Drews, J. Research & development. Basic science and pharmaceutical innovation. Nature Biotechnol.
17, 406 (1999).
427. Drews, J. Drug discovery: a historical perspective. Science 287, 1960±1964 (2000).
428. Davies, P. A. et al. The 5-HT3B subunit is a major determinant of serotonin-receptor function.
Nature 397, 359±363 (1999).
429. Heise, C. E. et al. Characterization of the human cysteinyl leukotriene 2 receptor. J. Biol. Chem. 275,
30531±30536 (2000).
430. Fan, W. et al. BACE maps to chromosome 11 and a BACE homolog, BACE2, reside in the obligate
Down Syndrome region of chromosome 21. Science 286, 1255a (1999).
431. Saunders, A. J., Kim, T. -W. & Tanzi, R. E. BACE maps to chromosome 11 and a BACE homolog,
BACE2, reside in the obligate Down Syndrome region of chromosome 21. Science 286, 1255a (1999).
432. Firestein, S. The good taste of genomics. Nature 404, 552±553 (2000).
433. Matsunami, H., Montmayeur, J. P. & Buck, L. B. A family of candidate taste receptors in human and
mouse. Nature 404, 601±604 (2000).
434. Adler, E. et al. A novel family of mammalian taste receptors. Cell 100, 693±702 (2000).
435. Chandrashekar, J. et al. T2Rs function as bitter taste receptors. Cell 100, 703±711 (2000).
436. Hardison, R. C. Conserved non-coding sequences are reliable guides to regulatory elements. Trends
Genet. 16, 369±372 (2000).
437. Onyango, P. et al. Sequence and comparative analysis of the mouse 1-megabase region orthologous
to the human 11p15 imprinted domain. Genome Res. 10, 1697±1710 (2000).
438. Bouck, J. B., Metzker, M. L. & Gibbs, R. A. Shotgun sample sequence comparisons between mouse
and human genomes. Nature Genet. 25, 31±33 (2000).
439. Marshall, E. Public-private project to deliver mouse genome in 6 months. Science 290, 242±243 (2000).
440. Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, C. E. Human-mouse
genome comparisons to locate regulatory sites. Nature Genet. 26, 225±228 (2000).
441. Tagle, D. A. et al. Embryonic epsilon and gamma globin genes of a prosimian primate (Galago
crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic
footprints. J. Mol. Biol. 203, 439±455 (1988).
442. McGuire, A. M., Hughes, J. D. & Church, G. M. Conservation of DNA regulatory motifs and
discovery of new motifs in microbial genomes. Genome Res. 10, 744±757 (2000).
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 919© 2001 Macmillan Magazines Ltd
Page 61
443. Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory motifs within
unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnol.
16, 939±945 (1998).
444. Cheng, Y. & Church, G. M. Biclustering of expression data. ISMB 8, 93±103 (2000).
445. Cohen, B. A., Mitra, R. D., Hughes, J. D. & Church, G. M. A computational analysis of whole-
genome expression data reveals chromosomal domains of gene expression. Nature Genet. 26, 183±
186 (2000).
446. Feil, R. & Khosla, S. Genomic imprinting in mammals: an interplay between chromatin and DNA
methylation? Trends Genet. 15, 431±434 (1999).
447. Robertson, K. D. & Wolffe, A. P. DNA methylation in health and disease. Nature Rev. Genet. 1, 11±19
(2000).
448. Beck, S., Olek, A. & Walter, J. From genomics to epigenomics: a loftier view of life. Nature Biotechnol.
17, 1144±1144 (1999).
449. Hagmann, M. Mapping a subtext in our genetic book. Science 288, 945±946 (2000).
450. Eliot, T. S. in T. S. Eliot. Collected Poems 1909±1962 (Harcourt Brace, New York, 1963).
451. Soderland, C., Longden, I. & Mott, R. FPC: a system for building contigs from restriction
®ngerprinted clones. Comput. Appl. Biosci. 13, 523±535 (1997).
452. Mott, R. & Tribe, R. Approximate statistics of gapped alignments. J. Comp. Biol. 6, 91±112 (1999).
Supplementary Information is available on Nature's World-Wide Web site
(http://www.nature.com) or as paper copy from the London editorial of®ce of Nature.
Acknowledgements
Beyond the authors, many people contributed to the success of this work. E. Jordan
provided helpful advice throughout the sequencing effort. We thank D. Leja and
J. Shehadeh for their expert assistance on the artwork in this paper, especially the foldout
®gure; K. Jegalian for editorial assistance; J. Schloss, E. Green and M. Seldin for comments
on an earlier version of the manuscript; P. Green and F. Ouelette for critiques of the
submitted version; C. Caulcott, A. Iglesias, S. Renfrey, B. Skene and J. Stewart of the
Wellcome Trust, P. Whittington and T. Dougans of NHGRI and M. Meugnier of
Genoscope for staff support for meetings of the international consortium; and the
University of Pennsylvania for facilities for a meeting of the genome analysis group.
We thank Compaq Computer Corporations's High Performance Technical Computing
Group for providing a Compaq Biocluster (a 27 node con®guration of AlphaServer ES40s,
containing 108 CPUs, serving as compute nodes and a ®le server with one terabyte of
secondary storage) to assist in the annotation and analysis. Compaq provided the systems
and implementation services to set up and manage the cluster for continuous use by
members of the sequencing consortium. Platform Computing Ltd. provided its LSF
scheduling and loadsharing software without license fee.
In addition to the data produced by the members of the International Human Genome
Sequencing Consortium, the draft genome sequence includes published and unpublished
human genomic sequence data from many other groups, all of whom gave permission to
include their unpublished data. Four of the groups that contributed particularly sig-
ni®cant amounts of data were: M. Adams et al. of the Institute for Genomic Research;
E. Chen et al. of the Center for Genetic Medicine and Applied Biosystems; S.-F. Tsai of
National Yang-Ming University, Institute of Genetics, Taipei, Taiwan, Republic of China; and
Y. Nakamura, K. Koyama et al. of the Institute of Medical Science, University of Tokyo,
Human Genome Center, Laboratory of Molecular Medicine, Minato-ku, Tokyo, Japan.
Many other groups provided smaller numbers of database entries. We thank them all; a full
list of the contributors of unpublished sequence is available as Supplementary Information.
This work was supported in part by the National Human Genome Research Institute of
the US NIH; The Wellcome Trust; the US Department of Energy, Of®ce of Biological and
Environmental Research, Human Genome Program; the UK MRC; the Human Genome
Sequencing Project from the Science and Technology Agency (STA) Japan; the Ministry of
Education, Science, Sport and Culture, Japan; the French Ministry of Research; the Federal
German Ministry of Education, Research and Technology (BMBF) through Projekttra
È
ger
DLR, in the framework of the German Human Genome Project; BEO, Projekttra
È
ger
Biologie, Energie, Umwelt des BMBF und BMWT; the Max-Planck-Society; DFGÐ
Deutsche Forschungsgemeinschaft; TMWFK, Thu
È
ringer Ministerium fu
È
r Wissenschaft,
Forschung und Kunst; EC BIOMED2ÐEuropean Commission, Directorate Science,
Research and Development; Chinese Academy of Sciences (CAS), Ministry of Science and
Technology (MOST), National Natural Science Foundation of China (NSFC); US National
Science Foundation EPSCoR and The SNP Consortium Ltd. Additional support for
members of the Genome Analysis group came, in part, from an ARCS Foundation
Scholarship to T.S.F., a Burroughs Wellcome Foundation grant to C.B.B. and P.A.S., a DFG
grant to P.B., DOE grants to D.H., E.E.E. and T.S.F., an EU grant to P.B., a Marie-Curie
Fellowship to L.C., an NIH-NHGRI grant to S.R.E., an NIH grant to E.E.E., an NIH SBIR to
D.K., an NSF grant to D.H., a Swiss National Science Foundation grant to L.C., the David
and Lucille Packard Foundation, the Howard Hughes Medical Institute, the University of
California at Santa Cruz and the W. M. Keck Foundation.
Correspondence and requests for materials should be addressed to E. S. Lander (e-mail:
lander@genome.wi.mit.edu), R. H. Waterston (e-mail: bwaterst@watson.wustl.edu),
J. Sulston (e-mail: jes@sanger.ac.uk) or F. S. Collins (e-mail: fc23a@nih.gov).
articles
920 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Af®liations for authors: 1, Whitehead Institute for Biomedical Research, Center for
Genome Research, Nine Cambridge Center, Cambridge, Massachusetts 02142,
USA; 2, The Sanger Centre, The Wellcome Trust Genome Campus, Hinxton,
Cambridgeshire CB10 1RQ, United Kingdom; 3, Washington University Genome
Sequencing Center, Box 8501, 4444 Forest Park Avenue, St. Louis, Missouri 63108,
USA; 4, US DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek,
California 94598, USA; 5, Baylor College of Medicine Human Genome
Sequencing Center, Department of Molecular and Human Genetics, One Baylor
Plaza, Houston, Texas 77030, USA; 6, Department of Cellular and Structural
Biology, The University of Texas Health Science Center at San Antonio, 7703 Floyd
Curl Drive, San Antonio, Texas 78229-3900, USA; 7, Department of Molecular
Genetics, Albert Einstein College of Medicine, 1635 Poplar Street, Bronx, New York
10461, USA; 8, Baylor College of Medicine Human Genome Sequencing Center
and the Department of Microbiology & Molecular Genetics, University of Texas
Medical School, PO Box 20708, Houston, Texas 77225, USA; 9, RIKEN Genomic
Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku Yokohama-city, Kanagawa
230-0045, Japan; 10, Genoscope and CNRS UMR-8030, 2 Rue Gaston Cremieux,
CP 5706, 91057 Evry Cedex, France; 11, GTC Sequencing Center, Genome
Therapeutics Corporation, 100 Beaver Street, Waltham, Massachusetts
02453-8443, USA; 12, Department of Genome Analysis, Institute of Molecular
Biotechnology, Beutenbergstrasse 11, D-07745 Jena, Germany; 13, Beijing
Genomics Institute/Human Genome Center, Institute of Genetics, Chinese
Academy of Sciences, Beijing 100101, China; 14, Southern China National
Human Genome Research Center, Shanghai 201203, China; 15, Northern China
National Human Genome Research Center, Beijing 100176, China; 16,
Multimegabase Sequencing Center, The Institute for Systems Biology,
4225 Roosevelt Way, NE Suite 200, Seattle, Washington 98105, USA; 17, Stanford
Genome Technology Center, 855 California Avenue, Palo Alto, California 94304,
USA; 18, Stanford Human Genome Center and Department of Genetics, Stanford
University School of Medicine, Stanford, California 94305-5120, USA;
19, University of Washington Genome Center, 225 Fluke Hall on Mason Road,
Seattle, Washington 98195, USA; 20, Department of Molecular Biology,
Keio University School of Medicine, 35 Shinanomachi, Shinjuku-ku, Tokyo
160-8582, Japan; 21, University of Texas Southwestern Medical Center at Dallas,
6000 Harry Hines Blvd., Dallas, Texas 75235-8591, USA; 22, University of
Oklahoma's Advanced Center for Genome Technology, Dept. of Chemistry and
Biochemistry, University of Oklahoma, 620 Parrington Oval, Rm 311, Norman,
Oklahoma 73019, USA; 23, Max Planck Institute for Molecular Genetics,
Ihnestrasse 73, 14195 Berlin, Germany; 24, Cold Spring Harbor Laboratory, Lita
Annenberg Hazen Genome Center, 1 Bungtown Road, Cold Spring Harbor, New
York 11724, USA; 25, GBF - German Research Centre for Biotechnology,
Mascheroder Weg 1, D-38124 Braunschweig, Germany; 26, National Center for
Biotechnology Information, National Library of Medicine, National Institutes of
Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, Maryland 20894, USA; 27,
Department of Genetics, Case Western Reserve School of Medicine and University
Hospitals of Cleveland, BRB 720, 10900 Euclid Ave., Cleveland, Ohio 44106, USA;
28, EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, United Kingdom; 29, Max Delbru
È
ck Center for
Molecular Medicine, Robert-Rossle-Strasse 10, 13125 Berlin-Buch, Germany;
30, EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany; 31, Dept. of Biology,
Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge,
Massachusetts 02139-4307, USA; 32, Howard Hughes Medical Institute,
Dept. of Genetics, Washington University School of Medicine, Saint Louis,
Missouri 63110, USA; 33, Dept. of Computer Science, University of California at
Santa Cruz, Santa Cruz, California 95064, USA; 34, Affymetrix, Inc., 2612 8th St,
Berkeley, California 94710, USA; 35, Genome Exploration Research Group,
Genomic Sciences Center, RIKEN Yokohama Institute, 1-7-22 Suehiro-cho,
Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan; 36, Howard Hughes
Medical Institute, Department of Computer Science, University of California at
Santa Cruz, California 95064, USA; 37, University of Dublin, Trinity College,
Department of Genetics, Smur®t Institute, Dublin 2, Ireland; 38, Cambridge
Research Laboratory, Compaq Computer Corporation and MIT Genome Center,
1 Cambridge Center, Cambridge, Massachusetts 02142, USA; 39, Dept. of
Mathematics, University of California at Santa Cruz, Santa Cruz, California
95064, USA; 40, Dept. of Biology, University of California at Santa Cruz, Santa
Cruz, California 95064, USA; 41, Crown Human Genetics Center and
Department of Molecular Genetics, The Weizmann Institute of Science, Rehovot
71600, Israel; 42, Dept. of Genetics, Stanford University School of Medicine,
Stanford, California 94305, USA; 43, The University of Michigan Medical School,
Departments of Human Genetics and Internal Medicine, Ann Arbor, Michigan
© 2001 Macmillan Magazines Ltd
unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnol.
16, 939±945 (1998).
444. Cheng, Y. & Church, G. M. Biclustering of expression data. ISMB 8, 93±103 (2000).
445. Cohen, B. A., Mitra, R. D., Hughes, J. D. & Church, G. M. A computational analysis of whole-
genome expression data reveals chromosomal domains of gene expression. Nature Genet. 26, 183±
186 (2000).
446. Feil, R. & Khosla, S. Genomic imprinting in mammals: an interplay between chromatin and DNA
methylation? Trends Genet. 15, 431±434 (1999).
447. Robertson, K. D. & Wolffe, A. P. DNA methylation in health and disease. Nature Rev. Genet. 1, 11±19
(2000).
448. Beck, S., Olek, A. & Walter, J. From genomics to epigenomics: a loftier view of life. Nature Biotechnol.
17, 1144±1144 (1999).
449. Hagmann, M. Mapping a subtext in our genetic book. Science 288, 945±946 (2000).
450. Eliot, T. S. in T. S. Eliot. Collected Poems 1909±1962 (Harcourt Brace, New York, 1963).
451. Soderland, C., Longden, I. & Mott, R. FPC: a system for building contigs from restriction
®ngerprinted clones. Comput. Appl. Biosci. 13, 523±535 (1997).
452. Mott, R. & Tribe, R. Approximate statistics of gapped alignments. J. Comp. Biol. 6, 91±112 (1999).
Supplementary Information is available on Nature's World-Wide Web site
(http://www.nature.com) or as paper copy from the London editorial of®ce of Nature.
Acknowledgements
Beyond the authors, many people contributed to the success of this work. E. Jordan
provided helpful advice throughout the sequencing effort. We thank D. Leja and
J. Shehadeh for their expert assistance on the artwork in this paper, especially the foldout
®gure; K. Jegalian for editorial assistance; J. Schloss, E. Green and M. Seldin for comments
on an earlier version of the manuscript; P. Green and F. Ouelette for critiques of the
submitted version; C. Caulcott, A. Iglesias, S. Renfrey, B. Skene and J. Stewart of the
Wellcome Trust, P. Whittington and T. Dougans of NHGRI and M. Meugnier of
Genoscope for staff support for meetings of the international consortium; and the
University of Pennsylvania for facilities for a meeting of the genome analysis group.
We thank Compaq Computer Corporations's High Performance Technical Computing
Group for providing a Compaq Biocluster (a 27 node con®guration of AlphaServer ES40s,
containing 108 CPUs, serving as compute nodes and a ®le server with one terabyte of
secondary storage) to assist in the annotation and analysis. Compaq provided the systems
and implementation services to set up and manage the cluster for continuous use by
members of the sequencing consortium. Platform Computing Ltd. provided its LSF
scheduling and loadsharing software without license fee.
In addition to the data produced by the members of the International Human Genome
Sequencing Consortium, the draft genome sequence includes published and unpublished
human genomic sequence data from many other groups, all of whom gave permission to
include their unpublished data. Four of the groups that contributed particularly sig-
ni®cant amounts of data were: M. Adams et al. of the Institute for Genomic Research;
E. Chen et al. of the Center for Genetic Medicine and Applied Biosystems; S.-F. Tsai of
National Yang-Ming University, Institute of Genetics, Taipei, Taiwan, Republic of China; and
Y. Nakamura, K. Koyama et al. of the Institute of Medical Science, University of Tokyo,
Human Genome Center, Laboratory of Molecular Medicine, Minato-ku, Tokyo, Japan.
Many other groups provided smaller numbers of database entries. We thank them all; a full
list of the contributors of unpublished sequence is available as Supplementary Information.
This work was supported in part by the National Human Genome Research Institute of
the US NIH; The Wellcome Trust; the US Department of Energy, Of®ce of Biological and
Environmental Research, Human Genome Program; the UK MRC; the Human Genome
Sequencing Project from the Science and Technology Agency (STA) Japan; the Ministry of
Education, Science, Sport and Culture, Japan; the French Ministry of Research; the Federal
German Ministry of Education, Research and Technology (BMBF) through Projekttra
È
ger
DLR, in the framework of the German Human Genome Project; BEO, Projekttra
È
ger
Biologie, Energie, Umwelt des BMBF und BMWT; the Max-Planck-Society; DFGÐ
Deutsche Forschungsgemeinschaft; TMWFK, Thu
È
ringer Ministerium fu
È
r Wissenschaft,
Forschung und Kunst; EC BIOMED2ÐEuropean Commission, Directorate Science,
Research and Development; Chinese Academy of Sciences (CAS), Ministry of Science and
Technology (MOST), National Natural Science Foundation of China (NSFC); US National
Science Foundation EPSCoR and The SNP Consortium Ltd. Additional support for
members of the Genome Analysis group came, in part, from an ARCS Foundation
Scholarship to T.S.F., a Burroughs Wellcome Foundation grant to C.B.B. and P.A.S., a DFG
grant to P.B., DOE grants to D.H., E.E.E. and T.S.F., an EU grant to P.B., a Marie-Curie
Fellowship to L.C., an NIH-NHGRI grant to S.R.E., an NIH grant to E.E.E., an NIH SBIR to
D.K., an NSF grant to D.H., a Swiss National Science Foundation grant to L.C., the David
and Lucille Packard Foundation, the Howard Hughes Medical Institute, the University of
California at Santa Cruz and the W. M. Keck Foundation.
Correspondence and requests for materials should be addressed to E. S. Lander (e-mail:
lander@genome.wi.mit.edu), R. H. Waterston (e-mail: bwaterst@watson.wustl.edu),
J. Sulston (e-mail: jes@sanger.ac.uk) or F. S. Collins (e-mail: fc23a@nih.gov).
articles
920 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Af®liations for authors: 1, Whitehead Institute for Biomedical Research, Center for
Genome Research, Nine Cambridge Center, Cambridge, Massachusetts 02142,
USA; 2, The Sanger Centre, The Wellcome Trust Genome Campus, Hinxton,
Cambridgeshire CB10 1RQ, United Kingdom; 3, Washington University Genome
Sequencing Center, Box 8501, 4444 Forest Park Avenue, St. Louis, Missouri 63108,
USA; 4, US DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek,
California 94598, USA; 5, Baylor College of Medicine Human Genome
Sequencing Center, Department of Molecular and Human Genetics, One Baylor
Plaza, Houston, Texas 77030, USA; 6, Department of Cellular and Structural
Biology, The University of Texas Health Science Center at San Antonio, 7703 Floyd
Curl Drive, San Antonio, Texas 78229-3900, USA; 7, Department of Molecular
Genetics, Albert Einstein College of Medicine, 1635 Poplar Street, Bronx, New York
10461, USA; 8, Baylor College of Medicine Human Genome Sequencing Center
and the Department of Microbiology & Molecular Genetics, University of Texas
Medical School, PO Box 20708, Houston, Texas 77225, USA; 9, RIKEN Genomic
Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku Yokohama-city, Kanagawa
230-0045, Japan; 10, Genoscope and CNRS UMR-8030, 2 Rue Gaston Cremieux,
CP 5706, 91057 Evry Cedex, France; 11, GTC Sequencing Center, Genome
Therapeutics Corporation, 100 Beaver Street, Waltham, Massachusetts
02453-8443, USA; 12, Department of Genome Analysis, Institute of Molecular
Biotechnology, Beutenbergstrasse 11, D-07745 Jena, Germany; 13, Beijing
Genomics Institute/Human Genome Center, Institute of Genetics, Chinese
Academy of Sciences, Beijing 100101, China; 14, Southern China National
Human Genome Research Center, Shanghai 201203, China; 15, Northern China
National Human Genome Research Center, Beijing 100176, China; 16,
Multimegabase Sequencing Center, The Institute for Systems Biology,
4225 Roosevelt Way, NE Suite 200, Seattle, Washington 98105, USA; 17, Stanford
Genome Technology Center, 855 California Avenue, Palo Alto, California 94304,
USA; 18, Stanford Human Genome Center and Department of Genetics, Stanford
University School of Medicine, Stanford, California 94305-5120, USA;
19, University of Washington Genome Center, 225 Fluke Hall on Mason Road,
Seattle, Washington 98195, USA; 20, Department of Molecular Biology,
Keio University School of Medicine, 35 Shinanomachi, Shinjuku-ku, Tokyo
160-8582, Japan; 21, University of Texas Southwestern Medical Center at Dallas,
6000 Harry Hines Blvd., Dallas, Texas 75235-8591, USA; 22, University of
Oklahoma's Advanced Center for Genome Technology, Dept. of Chemistry and
Biochemistry, University of Oklahoma, 620 Parrington Oval, Rm 311, Norman,
Oklahoma 73019, USA; 23, Max Planck Institute for Molecular Genetics,
Ihnestrasse 73, 14195 Berlin, Germany; 24, Cold Spring Harbor Laboratory, Lita
Annenberg Hazen Genome Center, 1 Bungtown Road, Cold Spring Harbor, New
York 11724, USA; 25, GBF - German Research Centre for Biotechnology,
Mascheroder Weg 1, D-38124 Braunschweig, Germany; 26, National Center for
Biotechnology Information, National Library of Medicine, National Institutes of
Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, Maryland 20894, USA; 27,
Department of Genetics, Case Western Reserve School of Medicine and University
Hospitals of Cleveland, BRB 720, 10900 Euclid Ave., Cleveland, Ohio 44106, USA;
28, EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, United Kingdom; 29, Max Delbru
È
ck Center for
Molecular Medicine, Robert-Rossle-Strasse 10, 13125 Berlin-Buch, Germany;
30, EMBL, Meyerhofstrasse 1, 69012 Heidelberg, Germany; 31, Dept. of Biology,
Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge,
Massachusetts 02139-4307, USA; 32, Howard Hughes Medical Institute,
Dept. of Genetics, Washington University School of Medicine, Saint Louis,
Missouri 63110, USA; 33, Dept. of Computer Science, University of California at
Santa Cruz, Santa Cruz, California 95064, USA; 34, Affymetrix, Inc., 2612 8th St,
Berkeley, California 94710, USA; 35, Genome Exploration Research Group,
Genomic Sciences Center, RIKEN Yokohama Institute, 1-7-22 Suehiro-cho,
Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan; 36, Howard Hughes
Medical Institute, Department of Computer Science, University of California at
Santa Cruz, California 95064, USA; 37, University of Dublin, Trinity College,
Department of Genetics, Smur®t Institute, Dublin 2, Ireland; 38, Cambridge
Research Laboratory, Compaq Computer Corporation and MIT Genome Center,
1 Cambridge Center, Cambridge, Massachusetts 02142, USA; 39, Dept. of
Mathematics, University of California at Santa Cruz, Santa Cruz, California
95064, USA; 40, Dept. of Biology, University of California at Santa Cruz, Santa
Cruz, California 95064, USA; 41, Crown Human Genetics Center and
Department of Molecular Genetics, The Weizmann Institute of Science, Rehovot
71600, Israel; 42, Dept. of Genetics, Stanford University School of Medicine,
Stanford, California 94305, USA; 43, The University of Michigan Medical School,
Departments of Human Genetics and Internal Medicine, Ann Arbor, Michigan
© 2001 Macmillan Magazines Ltd
Page 62
articles
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 921
48109, USA; 44, MRC Functional Genetics Unit, Department of Human Anatomy
and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK; 45,
Institute for Systems Biology, 4225 Roosevelt Way NE, Seattle, WA 98105, USA;
46, National Human Genome Research Institute, US National Institutes of
Health, 31 Center Drive, Bethesda, Maryland 20892, USA; 47, Of®ce of Science,
US Department of Energy, 19901 Germantown Road, Germantown, Maryland
20874, USA; 48, The Wellcome Trust, 183 Euston Road, London, NW1 2BE, UK.
²
Present addresses: Genome Sequencing Project, Egea Biosciences, Inc., 4178
Sorrento Valley Blvd., Suite F, San Diego, CA 92121, USA (G.A.E.); INRA, Station
d'Ame
Â
lioration des Plantes, 63039 Clermont-Ferrand Cedex 2, France (L.C.).
DNA sequence databases
GenBank, National Center for Biotechnology Information, National Library of
Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda,
Maryland 20894, USA
EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, UK
DNA Data Bank of Japan, Center for Information Biology, National Institute of
Genetics, 1111 Yata, Mishima-shi, Shizuoka-ken 411-8540, Japan
© 2001 Macmillan Magazines Ltd
NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 921
48109, USA; 44, MRC Functional Genetics Unit, Department of Human Anatomy
and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK; 45,
Institute for Systems Biology, 4225 Roosevelt Way NE, Seattle, WA 98105, USA;
46, National Human Genome Research Institute, US National Institutes of
Health, 31 Center Drive, Bethesda, Maryland 20892, USA; 47, Of®ce of Science,
US Department of Energy, 19901 Germantown Road, Germantown, Maryland
20874, USA; 48, The Wellcome Trust, 183 Euston Road, London, NW1 2BE, UK.
²
Present addresses: Genome Sequencing Project, Egea Biosciences, Inc., 4178
Sorrento Valley Blvd., Suite F, San Diego, CA 92121, USA (G.A.E.); INRA, Station
d'Ame
Â
lioration des Plantes, 63039 Clermont-Ferrand Cedex 2, France (L.C.).
DNA sequence databases
GenBank, National Center for Biotechnology Information, National Library of
Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda,
Maryland 20894, USA
EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, UK
DNA Data Bank of Japan, Center for Information Biology, National Institute of
Genetics, 1111 Yata, Mishima-shi, Shizuoka-ken 411-8540, Japan
© 2001 Macmillan Magazines Ltd
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
745 Readers on Mendeley
by Discipline
8% Medicine
by Academic Status
33% Ph.D. Student
13% Post Doc
10% Student (Master)
by Country
26% United States
10% United Kingdom
9% Germany



