articles Initial sequencing and analysis of the human genome A partial list of authors appears on the opposite page. Affiliations are listed at the end of the paper. The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. The rediscovery of Mendel's laws of heredity in the opening weeks of coordinate regulation of the genes in the clusters the 20th century-'sparked a scientific quest to understand the There appear to be about 30,000-40,000 protein-coding genes in nature and content of genetic information that has propelled the human genome-only about twice as many as in worm or fly. biology for the last hundred years. The scientific progress made However, the genes are more complex, with more alternative falls naturally into four main phases, corresponding roughly to the splicing generating a larger number of protein products. four quarters of the century. The first established the cellular basis of The full set of proteins(the proteome)encoded by the human heredity: the chromosomes. The second defined the molecular basis genome is more complex than those of invertebrates. This is due in f heredity: the dNA double helix. The third unlocked the informa- part to the presence of vertebrate-specific protein domains and tional basis of heredity, with the discovery of the biological mechan- motifs(an estimated 7% of the total), but more to the fact that ism by which cells read the information contained in genes and with vertebrates appear to have arranged pre-existing components into a the invention of the recombinant DNA technologies of cloning and richer collection of domain architectures sequencing by which scientists can do the same. Hundreds of human genes appear likely to have resulted from The last quarter of a century has been marked by a relentless drive horizontal transfer from bacteria at some point in the vertebrate to decipher first genes and then entire genomes, spawning the field ge. Dozens of genes appear to have been derived from trans of genomics. The fruits of this work already include the genome posable elements. quences of 599 viruses and viroids, 205 naturally occurring Although about half of the human genome derives from trans- plasmids, 185 organelles, 31 eubacteria, seven archaea, one posable elements, there has been a marked decline in the overall fungus, two animals and one plant activity of such elements in the hominid lineage. DNA transposons Here we report the results of a collaboration involving 20 groups appear to have become completely inactive and long-terminal from the United States, the United Kingdom, Japan, France, repeat(LTR)retroposons may also have done so Germany and China to produce a draft sequence of the human The pericentromeric and subtelomeric regions of chromosomes genome. The draft genome sequence was generated from a physical are filled with large recent segmental duplications of sequence from ap covering more than 96% of the euchromatic part of the human elsewhere in the genome. Segmental duplication is much more it covers about 94% of the human genome. The sequence was a w)isis of thea s than in yeast, fly or worm organization of Alu elements explains the long roduced over a relatively short period, with coverage rising from standing mystery of their surprising genomic distribution, and about 10% to more than 90% over roughly fifteen months. The suggests that there may be strong selection in favour of preferential sequence data have been made available without restriction and retention of Alu elements in GC-rich regions and that these selfish updated daily throughout the project. The task ahead is to produce a elements may benefit their human hosts finished sequence, by closing all gaps and resolving all ambiguities. The mutation rate is about twice as high in male as in female Already about one billion bases are in final form and the task of meiosis, showing that most mutation occurs in males bringing the vast majority of the sequence to this standard is now Cytogenetic analysis of the sequenced clones confirms sugges- tions that large GC-poor regions are strongly correlated with dark The sequence of the human genome is of interest in several G-bands in karyotypes st genome to be extensively sequenced so far, Recombination rates tend to be much higher in distal region eeing 25 times as large as any previously sequenced genome and (around 20 megabases(Mb))of chror mosomes and on shorter eight times as large as the sum of all such genomes. It is the first chromosome arms in general, in a pattern that promotes the vertebrate genome to be extensively sequenced. And, uniquely, it is occurrence of at least one crossover per chromosome arm in each Much work remains to be done to produce a complete finished More than 1.4 million single nucleotide polymorphisms(SNPs) sequence, but the vast trove of information that has become in the human genome have been identified. This collection should available through this collaborative effort allows a global perspective allow the tion of genome-wide linkage n the human genome. Although the details will change as the mapping of the genes in the human population is finished In this paper, we start by presenting background information on The genomic landscape shows marked variation in the distribu- the project and describing the generation, assembly and evaluation tion of a number of features, including genes, transposable of the draft genome sequence. We then focus on an initial analysis of elements, GC content, CpG islands and recombination rate. This the sequence itself: the broad chromosomal landscape; the repeat gives us important clues about function. For example, the devel- elements and the rich palaeontological record of evolutionary and opmentally important HOX gene clusters are the most repeat-poor biological processes that they provide; the human genes and regions of the human genome, probably reflecting the very complex proteins and their differences and similarities with those of other 860 A@2001 Macmillan Magazines Ltd NATURE VOL 4091 15 FEBRUARY 2001
Initial sequencing and analysis of the human genome International Human Genome Sequencing Consortium* * A partial list of authors appears on the opposite page. Af®liations are listed at the end of the paper. ............................................................................................................................................................................................................................................................................ The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century1±3 sparked a scienti®c quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scienti®c progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The ®rst established the cellular basis of heredity: the chromosomes. The second de®ned the molecular basis of heredity: the DNA double helix. The third unlocked the informational basis of heredity, with the discovery of the biological mechanism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same. The last quarter of a century has been marked by a relentless drive to decipher ®rst genes and then entire genomes, spawning the ®eld of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant. Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly ®fteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a ®nished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in ®nal form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly. The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the ®rst vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species. Much work remains to be done to produce a complete ®nished sequence, but the vast trove of information that has become available through this collaborative effort allows a global perspective on the human genome. Although the details will change as the sequence is ®nished, many points are already clear. X The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate. This gives us important clues about function. For example, the developmentally important HOX gene clusters are the most repeat-poor regions of the human genome, probably re¯ecting the very complex coordinate regulation of the genes in the clusters. X There appear to be about 30,000±40,000 protein-coding genes in the human genomeÐonly about twice as many as in worm or ¯y. However, the genes are more complex, with more alternative splicing generating a larger number of protein products. X The full set of proteins (the `proteome') encoded by the human genome is more complex than those of invertebrates. This is due in part to the presence of vertebrate-speci®c protein domains and motifs (an estimated 7% of the total), but more to the fact that vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures. X Hundreds of human genes appear likely to have resulted from horizontal transfer from bacteria at some point in the vertebrate lineage. Dozens of genes appear to have been derived from transposable elements. X Although about half of the human genome derives from transposable elements, there has been a marked decline in the overall activity of such elements in the hominid lineage. DNA transposons appear to have become completely inactive and long-terminal repeat (LTR) retroposons may also have done so. X The pericentromeric and subtelomeric regions of chromosomes are ®lled with large recent segmental duplications of sequence from elsewhere in the genome. Segmental duplication is much more frequent in humans than in yeast, ¯y or worm. X Analysis of the organization of Alu elements explains the longstanding mystery of their surprising genomic distribution, and suggests that there may be strong selection in favour of preferential retention of Alu elements in GC-rich regions and that these `sel®sh' elements may bene®t their human hosts. X The mutation rate is about twice as high in male as in female meiosis, showing that most mutation occurs in males. X Cytogenetic analysis of the sequenced clones con®rms suggestions that large GC-poor regions are strongly correlated with `dark G-bands' in karyotypes. X Recombination rates tend to be much higher in distal regions (around 20 megabases (Mb)) of chromosomes and on shorter chromosome arms in general, in a pattern that promotes the occurrence of at least one crossover per chromosome arm in each meiosis. X More than 1.4 million single nucleotide polymorphisms (SNPs) in the human genome have been identi®ed. This collection should allow the initiation of genome-wide linkage disequilibrium mapping of the genes in the human population. In this paper, we start by presenting background information on the project and describing the generation, assembly and evaluation of the draft genome sequence. We then focus on an initial analysis of the sequence itself: the broad chromosomal landscape; the repeat elements and the rich palaeontological record of evolutionary and biological processes that they provide; the human genes and proteins and their differences and similarities with those of other articles 860 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
articles sequence contributed, with a partial list of personnel. A full list of Gerald Nyakatura'2, Stefan Taudien"2& Andreas Rump'2 contributors at each centre is available as Supplementary Information Beijing Genomics Institute/Human Genome Center Huanming Yang 3, Jun Yu, Jian Wang 3, Guyang Huang Whitehead Institute for Biomedical Research Center for Genome Jun Gu'5 Research: Eric S Lander Lauren m. linton Bruce Birren Chad Nusbaum, Michael C Zody,, Jennifer Baldwin Multimegabase Sequencing Center, The Institute for Syst Keri Devon, Ken Dewar, Michael Doyle, william FitzHugh*, Lee Rowen, Anup Madan& Shizen Qin Roel Funke, Diane Gage, Katrina Harris, Andrew Heaford John Howland, Lisa Kann, Jessica Lehoczky, Rosie Levine Stanford Genome Technology Center: Ronald W. Davis" Paul McEwan, Kevin McKernan, James Meldrim, Jill P. Mesiroy , Nancy A Federspiel ", A Pia Abola"&Mi Cher Miranda, William Morris', Jerome Naylor Christina Raymond, Mark Rosetti, Ralph Santos' Stanford Human Genome Center: Richard M. Myers Andrew Sheridan, Carrie Sougnez, Nicole Stange-Thomann' Jeremy Schmutz, Mark Dickson, Jane Grimwood David R cox18 Nikola Stojanovic, Aravind Subramanian dudley WymaN University of Washington Genome Center: Maynard V. Olson The Sanger Centre: Jane Rogers, John Sulston?2 Rajinder Kaul& Christopher Raymor Rachael Ainscough, Stephan Beck, David Bentley, John Burton, Department of Molecular Biology, Keio University School of Christopher Clee, Nigel Carter, Alan Coulson Medicine: Nobuyoshi Shimizu Kazuhiko Kawasaki Rebecca Deadman Panos Deloukas Andrew Dunham shinsei minoshima lan Dunham, Richard Durbin*, Lisa French, Darren Grafham Simon Gregory, Tim Hubbard, Sean Humphray, Adrienne Hunt, University of Texas Southwestern Medical Center at Dallas: Matthew Jones, Christine Lloyd, Amanda McMurray? Glen A. Evans2t, Maria Athanasiou& Roger Schultz Lucy Matthews, Simon Mercer?, Sarah Milne, James C Mullikin+ Andrew Mungall, Robert Plumb, Mark Ross Ratna Shownkeen University of Oklahoma,'s Advanced Center for Genor sarah Sims Technology: Bruce A Roe, Feng Chen"& Huaqin Pan Washington University Genome Sequencing Center: Max Planck Institute for Molecular Genetics: Juliane ramser Robert H. Waterston , Richard K, Wilson LaDeana W Hillier. Hans Lehrach2& Richard Reinhardt 3 John D. McPherson. Marco A Marra. Elaine R. Mardis Lucinda A. Fulton, Asif T. Chinwalla, Kymberlie H. Pepin Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Warren R. Gish, Stephanie L. Chissoe, Michael C Wendl Center: W. Richard Mc Combie Melissa de la Bastide KimD. Delehaunty Tracie L Miner, Andrew Delehaunty' Neilay Dedhia Jason B. Kramer Lisa L Cook. Robert S Fulton Douglas L Johnson, Patrick J Minx&Sandra W. Clifton GBF-German Research Centre for Biotechnology Helmut blocker 5 Klaus hornischer25 Gabriele nordsiek25 US DOE Joint Genome Institute: Trevor Hawkins Elbert Branscomb", Paul Predki, Paul Richardson, Genome Analysis Group(listed in alphabetical order, also includes individuals listed under other headings): Sarah Wenning, Tom Slezak, Norman Doggetr, Jan-Fang Cheng, Richa Agarwala26, L. Aravind26, Jeffrey A Bai Anne Olsen, Susan Lucas, Christopher Elkin Edward Uberbacher& Marvin frazier Serafim Batzoglou, Ewan Bimey, Peer Bork230,DanielGBrown Christopher B Burge, Lorenzo Cerutti, Hsiu-Chuan Chen Baylor College of Medicine Human Genome Sequencing Center: Deanna Church Michele Clamp?, Richard R. Copley2o0 Richard A. Gibbs5. Donna M. mur Steven e schi Tobias Doerks29,30, Sean R. Eddy, Evan E Eichler, JohnB.Bouck+, Erica J.Sodergren, Kim C. Worley., Catherine M. Terrence S Furey, James Galagan James G.R. Gilbert gs Susan L. Naylor, Raju S Kucherlapati, David L. Nelson Henning Hermjakob, Karsten Hokamp 7, Wonhee Jang L Steven Johnson 2. Thomas A. Jones32 Simon Kasit a Arek Kaspryzk, Scot Kennedy, W. James Kent, Paul Kitts Eugene V Koonin, lan Korf, David Kulp, Doron Lancet Todd M. Lowe", Aoife McLysaght, Tarjei Mikkelsen John V moral cola mulder victor j. pollara Chris P. Ponting", Greg Schuler, Jorg Schultz o, Guy Slater rian F A Smit", Elia Stupka2, Joseph Szustakowki38, Roland o pe and CNRS UMR-8030: Jean Weissenbach"( Danielle Thierry-Mieg26, Jean Thierry-Mieg2, Lukas Wagner Roland Heilig", William Saurin, Francois Artiguenave John Wallis, Raymond Wheeler Alan Williams, Yuri L Wolf Philippe Brottier, Thomas Bruls", Eric Pelletier KennethH. Wolfe", Shiaw-Pyng Yang Ru-Fang Yeh 1 Catherine Roberto Patrick Wincker10 Scientific management: National Human Genome Research GTC Sequencing Center: Douglas R Smith Institute, US National institutes of Health: francis collins Lynn Doucette-Stamm", Marc Rubenfiel Keith Weinstock, Mark S. Guyer Jane Peterson", Adam Felsenfeld Hong Mei Lee"& JoAnn Dubois"1 Kris A. Wetterstrand"; Office of Science, US Department of Energy: Aristides Patrinos"; The Wellcome Trust: Michael J. Department of Genome Analysis, Institute of Molecular NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd
articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 861 Genome Sequencing Centres (Listed in order of total genomic sequence contributed, with a partial list of personnel. A full list of contributors at each centre is available as Supplementary Information.) Whitehead Institute for Biomedical Research, Center for Genome Research: Eric S. Lander1 *, Lauren M. Linton1 , Bruce Birren1 *, Chad Nusbaum1 *, Michael C. Zody1 *, Jennifer Baldwin1 , Keri Devon1 , Ken Dewar1 , Michael Doyle1 , William FitzHugh1 *, Roel Funke1 , Diane Gage1 , Katrina Harris1 , Andrew Heaford1 , John Howland1 , Lisa Kann1 , Jessica Lehoczky1 , Rosie LeVine1 , Paul McEwan1 , Kevin McKernan1 , James Meldrim1 , Jill P. Mesirov1 *, Cher Miranda1 , William Morris1 , Jerome Naylor1 , Christina Raymond1 , Mark Rosetti1 , Ralph Santos1 , Andrew Sheridan1 , Carrie Sougnez1 , Nicole Stange-Thomann1 , Nikola Stojanovic1 , Aravind Subramanian1 & Dudley Wyman1 The Sanger Centre: Jane Rogers2 , John Sulston2 *, Rachael Ainscough2 , Stephan Beck2 , David Bentley2 , John Burton2 , Christopher Clee2 , Nigel Carter2 , Alan Coulson2 , Rebecca Deadman2 , Panos Deloukas2 , Andrew Dunham2 , Ian Dunham2 , Richard Durbin2 *, Lisa French2 , Darren Grafham2 , Simon Gregory2 , Tim Hubbard2 *, Sean Humphray2 , Adrienne Hunt2 , Matthew Jones2 , Christine Lloyd2 , Amanda McMurray2 , Lucy Matthews2 , Simon Mercer2 , Sarah Milne2 , James C. Mullikin2 *, Andrew Mungall2 , Robert Plumb2 , Mark Ross2 , Ratna Shownkeen2 & Sarah Sims2 Washington University Genome Sequencing Center: Robert H. Waterston3 *, Richard K. Wilson3 , LaDeana W. Hillier3 *, John D. McPherson3 , Marco A. Marra3 , Elaine R. Mardis3 , Lucinda A. Fulton3 , Asif T. Chinwalla3 *, Kymberlie H. Pepin3 , Warren R. Gish3 , Stephanie L. Chissoe3 , Michael C. Wendl3 , Kim D. Delehaunty3 , Tracie L. Miner3 , Andrew Delehaunty3 , Jason B. Kramer3 , Lisa L. Cook3 , Robert S. Fulton3 , Douglas L. Johnson3 , Patrick J. Minx3 & Sandra W. Clifton3 US DOE Joint Genome Institute: Trevor Hawkins4 , Elbert Branscomb4 , Paul Predki4 , Paul Richardson4 , Sarah Wenning4 , Tom Slezak4 , Norman Doggett4 , Jan-Fang Cheng4 , Anne Olsen4 , Susan Lucas4 , Christopher Elkin4 , Edward Uberbacher4 & Marvin Frazier4 Baylor College of Medicine Human Genome Sequencing Center: Richard A. Gibbs5 *, Donna M. Muzny5 , Steven E. Scherer5 , John B. Bouck5 *, Erica J. Sodergren5 , Kim C. Worley5 *, Catherine M. Rives5 , James H. Gorrell5 , Michael L. Metzker5 , Susan L. Naylor6 , Raju S. Kucherlapati7 , David L. Nelson, & George M. Weinstock8 RIKEN Genomic Sciences Center: Yoshiyuki Sakaki9 , Asao Fujiyama9 , Masahira Hattori9 , Tetsushi Yada9 , Atsushi Toyoda9 , Takehiko Itoh9 , Chiharu Kawagoe9 , Hidemi Watanabe9 , Yasushi Totoki9 & Todd Taylor9 Genoscope and CNRS UMR-8030: Jean Weissenbach10, Roland Heilig10, William Saurin10, Francois Artiguenave10, Philippe Brottier10, Thomas Bruls10, Eric Pelletier10, Catherine Robert10 & Patrick Wincker10 GTC Sequencing Center: Douglas R. Smith11, Lynn Doucette-Stamm11, Marc Ruben®eld11, Keith Weinstock11, Hong Mei Lee11 & JoAnn Dubois11 Department of Genome Analysis, Institute of Molecular Biotechnology: Andre Rosenthal12, Matthias Platzer12, Gerald Nyakatura12, Stefan Taudien12 & Andreas Rump12 Beijing Genomics Institute/Human Genome Center: Huanming Yang13, Jun Yu13, Jian Wang13, Guyang Huang14 & Jun Gu15 Multimegabase Sequencing Center, The Institute for Systems Biology: Leroy Hood16, Lee Rowen16, Anup Madan16 & Shizen Qin16 Stanford Genome Technology Center: Ronald W. Davis17, Nancy A. Federspiel17, A. Pia Abola17 & Michael J. Proctor17 Stanford Human Genome Center: Richard M. Myers18, Jeremy Schmutz18, Mark Dickson18, Jane Grimwood18 & David R. Cox18 University of Washington Genome Center: Maynard V. Olson19, Rajinder Kaul19 & Christopher Raymond19 Department of Molecular Biology, Keio University School of Medicine: Nobuyoshi Shimizu20, Kazuhiko Kawasaki20 & Shinsei Minoshima20 University of Texas Southwestern Medical Center at Dallas: Glen A. Evans21², Maria Athanasiou21 & Roger Schultz21 University of Oklahoma's Advanced Center for Genome Technology: Bruce A. Roe22, Feng Chen22 & Huaqin Pan22 Max Planck Institute for Molecular Genetics: Juliane Ramser23, Hans Lehrach23 & Richard Reinhardt23 Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Center: W. Richard McCombie24, Melissa de la Bastide24 & Neilay Dedhia24 GBFÐGerman Research Centre for Biotechnology: Helmut BloÈ cker25, Klaus Hornischer25 & Gabriele Nordsiek25 * Genome Analysis Group (listed in alphabetical order, also includes individuals listed under other headings): Richa Agarwala26, L. Aravind26, Jeffrey A. Bailey27, Alex Bateman2 , Sera®m Batzoglou1 , Ewan Birney28, Peer Bork29,30, Daniel G. Brown1 , Christopher B. Burge31, Lorenzo Cerutti28, Hsiu-Chuan Chen26, Deanna Church26, Michele Clamp2 , Richard R. Copley30, Tobias Doerks29,30, Sean R. Eddy32, Evan E. Eichler27, Terrence S. Furey33, James Galagan1 , James G. R. Gilbert2 , Cyrus Harmon34, Yoshihide Hayashizaki35, David Haussler36, Henning Hermjakob28, Karsten Hokamp37, Wonhee Jang26, L. Steven Johnson32, Thomas A. Jones32, Simon Kasif38, Arek Kaspryzk28, Scot Kennedy39, W. James Kent40, Paul Kitts26, Eugene V. Koonin26, Ian Korf3 , David Kulp34, Doron Lancet41, Todd M. Lowe42, Aoife McLysaght37, Tarjei Mikkelsen38, John V. Moran43, Nicola Mulder28, Victor J. Pollara1 , Chris P. Ponting44, Greg Schuler26, JoÈrg Schultz30, Guy Slater28, Arian F. A. Smit45, Elia Stupka28, Joseph Szustakowki38, Danielle Thierry-Mieg26, Jean Thierry-Mieg26, Lukas Wagner26, John Wallis3 , Raymond Wheeler34, Alan Williams34, Yuri I. Wolf26, Kenneth H. Wolfe37, Shiaw-Pyng Yang3 & Ru-Fang Yeh31 Scienti®c management: National Human Genome Research Institute, US National Institutes of Health: Francis Collins46*, Mark S. Guyer46, Jane Peterson46, Adam Felsenfeld46* & Kris A. Wetterstrand46; Of®ce of Science, US Department of Energy: Aristides Patrinos47; The Wellcome Trust: Michael J. Morgan48 © 2001 Macmillan Magazines Ltd
articles organisms; and the history of genomic segments (Comparisons (4)The development of random shotgun sequencing of comple- are drawn throughout with the genomes of the budding yeast mentary DNA fragments for high-throughput gene discovery by Saccharomyces cerevisiae, the nematode worm Caenorhabditis Schimmeland Schimmel and Sutcliffe, later dubbed expressed elegans, the fruitfly Drosophila melanogaster and the mustard weed sequence tags(ESTs)and pursued with automated sequencing by Arabidopsis thaliana; we refer to these for convenience simply as Venter and others- yeast, worm, fly and mustard weed. Finally, we discuss applications The idea of sequencing the entire human genome was first of the sequence to biology and medicine and describe next steps in proposed in discussions at scientific meetings organized by the the project. A full description of the methods is provided as US Department of Energy and others from 1984 to 1986(refs 21 epplementaryInformationonNature'swebsite(http://www.22).AcommitteeappointedbytheUsNationalResearchCouncil endorsed the concept in its 1988 report", but recommer ded a We recognize that it is impossible to provide a comprehensive broader programme, to include: the creation of genetic, physical analysis of this vast dataset, and thus our goal is to illustrate the and sequence maps of the human genome; parallel efforts in key ange of insights that can be gleaned from the human genome and model organisms such as bacteria, yeast, worms, flies and mice; the ereby to sketch a research agenda for the future development of technology in support of these objectives; and research into the ethical, legal and social issues raised by human Background to the human Genome Project genome research. The programme was launched in the US as a joint effort of the Department of Energy and the National Institutes of The Human Genome Project arose from two key insights that Health. In other countries, the UK Medical Research Council and emerged in the early 1980s: that the ability to take global views of the Wellcome Trust supported genomic research in Britain; the genomes could greatly accelerate biomedical research, by allowing Centre d'Etude du Polymorphisme Humain and the French Mus- researchers to attack problems in a comprehensive and unbiased cular Dystrophy Association launched mapping efforts in france: fashion; and that the creation of such global views would require a government agencies, including the Science and Technology Agency communal effort in infrastructure building, unlike anything pre- and the Ministry of Education, Science, Sports and Culture sup ously attempted in biomedical research. Several key projects ported genomic research efforts in Japan; and the European Com elped to crystallize these insights, including: munity helped to launch several international efforts, notably the (1) The sequencing of the bacterial viruses pX174"and lambda, the programme to sequence the yeast genome. By late 1990, the Human animal virus SV40 and the human mitochondrion between 1977 Genome Project had been launched, with the creation of genome and 1982. These projects proved the feasibility of assembling small centres in these countries. Additional participants subsequently sequence fragments into complete genomes, and showed the value joined the effort, notably in Germany and China. In addition, the their inheritance patterns, launched by Botstein and colleagues in of the Human Genome Project O)was founded to provide a of complete catalogues of genes and other functional elements. Human Genome Organization(HUGo)was founded to provide a (2 ible to locatd:e to create a human genetic map to make it forum for international coordination of genomic research.Several ease genes of unknown function based solely on books"- provide a more comprehensive discussion of the genesis 980(ref.9) Through 1995, work progressed rapidly on two fronts( Fig. 1) (3)The programmes to create physical maps of clones covering the The first was construction of genetic and physical maps of the yeastand worm" genomes to allow isolation of genes and regions human and mouse genomes-, providing key tools for identifica- based solely on their chromosomal position, launched by Olson and tion of disease genes and anchoring points for genomic sequence. Sulston in the mid-1980s The second was sequencing of the yeast and worm"genomes, as 1984 199019911992199319941995199619971998199920002001 Discussion and debate in scientific community E co S cerevisiae sequencing A thaliana sequ Genetic maps Microsatellites SNPs cDNA sequencing Genomic sequencing Genetic maps Microsatellites CDNA sequence Genomic sequencing Pilot project, 15%6 9 Finishing.-100% Figure 1 Timeline of large-scale genomic analyses Shown are selected components of (green) from 1990; earlier projects are described in the text SNPs, single nucleotide work on several non-vertebrate model organisms(red), the mouse(blue)and the human polymorphisms; ESTS, expressed sequence tags. 862 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
organisms; and the history of genomic segments. (Comparisons are drawn throughout with the genomes of the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruit¯y Drosophila melanogaster and the mustard weed Arabidopsis thaliana; we refer to these for convenience simply as yeast, worm, ¯y and mustard weed.) Finally, we discuss applications of the sequence to biology and medicine and describe next steps in the project. A full description of the methods is provided as Supplementary Information on Nature's web site (http://www. nature.com). We recognize that it is impossible to provide a comprehensive analysis of this vast dataset, and thus our goal is to illustrate the range of insights that can be gleaned from the human genome and thereby to sketch a research agenda for the future. Background to the Human Genome Project The Human Genome Project arose from two key insights that emerged in the early 1980s: that the ability to take global views of genomes could greatly accelerate biomedical research, by allowing researchers to attack problems in a comprehensive and unbiased fashion; and that the creation of such global views would require a communal effort in infrastructure building, unlike anything previously attempted in biomedical research. Several key projects helped to crystallize these insights, including: (1) The sequencing of the bacterial viruses FX1744,5 and lambda6 , the animal virus SV407 and the human mitochondrion8 between 1977 and 1982. These projects proved the feasibility of assembling small sequence fragments into complete genomes, and showed the value of complete catalogues of genes and other functional elements. (2) The programme to create a human genetic map to make it possible to locate disease genes of unknown function based solely on their inheritance patterns, launched by Botstein and colleagues in 1980 (ref. 9). (3) The programmes to create physical maps of clones covering the yeast10 and worm11 genomes to allow isolation of genes and regions based solely on their chromosomal position, launched by Olson and Sulston in the mid-1980s. (4) The development of random shotgun sequencing of complementary DNA fragments for high-throughput gene discovery by Schimmel12 and Schimmel and Sutcliffe13, later dubbed expressed sequence tags (ESTs) and pursued with automated sequencing by Venter and others14±20. The idea of sequencing the entire human genome was ®rst proposed in discussions at scienti®c meetings organized by the US Department of Energy and others from 1984 to 1986 (refs 21, 22). A committee appointed by the US National Research Council endorsed the concept in its 1988 report23, but recommended a broader programme, to include: the creation of genetic, physical and sequence maps of the human genome; parallel efforts in key model organisms such as bacteria, yeast, worms, ¯ies and mice; the development of technology in support of these objectives; and research into the ethical, legal and social issues raised by human genome research. The programme was launched in the US as a joint effort of the Department of Energy and the National Institutes of Health. In other countries, the UK Medical Research Council and the Wellcome Trust supported genomic research in Britain; the Centre d'Etude du Polymorphisme Humain and the French Muscular Dystrophy Association launched mapping efforts in France; government agencies, including the Science and Technology Agency and the Ministry of Education, Science, Sports and Culture supported genomic research efforts in Japan; and the European Community helped to launch several international efforts, notably the programme to sequence the yeast genome. By late 1990, the Human Genome Project had been launched, with the creation of genome centres in these countries. Additional participants subsequently joined the effort, notably in Germany and China. In addition, the Human Genome Organization (HUGO) was founded to provide a forum for international coordination of genomic research. Several books24±26 provide a more comprehensive discussion of the genesis of the Human Genome Project. Through 1995, work progressed rapidly on two fronts (Fig. 1). The ®rst was construction of genetic and physical maps of the human and mouse genomes27±31, providing key tools for identi®cation of disease genes and anchoring points for genomic sequence. The second was sequencing of the yeast32 and worm33 genomes, as articles 862 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 1984 1990 1991 1992 1993 1994 1995 1996 1997 1998 2000 1999 2001 Bacterial genome sequencing H. flu E. coli 39 species S. cerevisiae sequencing C. elegans sequencing D. melanogaster sequencing A. thaliana sequencing Microsatellites ESTs cDNA sequencing Genetic maps Physical maps Genetic maps Physical maps Genomic sequencing cDNA sequencing Genomic sequencing Full length ESTs Full length SNPs Microsatellites Pilot project,15% Chromosome 22 Chromosome 21 Working draft, 90% SNPs Pilot sequencing Finishing, ~100% Discussion and debate in scientific community NRC report Other organisms Mouse Human Figure 1 Timeline of large-scale genomic analyses. Shown are selected components of work on several non-vertebrate model organisms (red), the mouse (blue) and the human (green) from 1990; earlier projects are described in the text. SNPs, single nucleotide polymorphisms; ESTs, expressed sequence tags. © 2001 Macmillan Magazines Ltd
articles well as targeted regions of mammalian genomes"-. These projects libraries with more uniform representation. The practice of sequen- showed that large-scale sequencing was feasible and developed the cing from both ends of double-stranded clones(double-barrelled two-phase paradigm for genome sequencing. In the first, 'shotgun, shotgun sequencing) was introduced by Ansorge and others"in phase, the genome is divided into appropriately sized segments and 1990, allowing the use of linking information between sequence each segment is covered to a high degree of redundancy(typically, fragments 35t to tenfold) through the sequencing of randomly selected The application of shotg was also extended ubfragments. The second is a'finishing'phase, in which sequence applying it to larger and larger DNA molecules--from plasm gaps are closed and remaining ambiguities are resolved through (4 kilobases(kb))to cosmid clones(40 kb), to artificial chro directed analysis. The results also showed that complete genomic mosomes cloned in bacteria and yeast(100-500 kb)and bacterial equence provided information about genes, regulatory regions and genomes(1-2 megabases(Mb). In principle, a genome of arbi In 1995, genome scientists considered a proposals that would formly sampled at random. beated s by the shotgun method, chromosome structure that was not readily obtainable from cDNA trary size may be directly sequenced by the shotgun method, studies alone genome in a first phase and then returning to finish the sequence in one detects overlaps by consulting an alphabetized look-up table of second phase. After vigorous debate, it was decided that such a all k-letter words in the data). Mathematical analysis of the plan was premature for several reasons. These included the need first expected number of gaps as a function of coverage is similarly to prove that high-quality, long-range finished sequence could be straightforward?. produced from most parts of the complex, repeat-rich human Practical difficulties arise because of repeated sequences and genome; the sense that many aspects of the sequencing process cloning bias. Small amounts of repeated sequence pose little were still rapidly evolving; and the desirability of further decreasing problem for shotgun sequencing. For example, one can readily costs assemble typical bacterial genomes(about 1. 5% repeat)or the Instead, pilot projects were launched to demonstrate the feasi- euchromatic portion of the fly genome(about 3% repeat). By bility of cost-effective, large-scale sequencing, with a target comple- contrast, the human genome is filled(> 50%) with repeated tion date of March 1999. The projects successfully produced sequences, including interspersed repeats derived from transposable finished sequence with 99.99% accuracy and no gaps. They also elements, and long genomic regions that have been duplicated in introduced bacterial artificial chromosomes( BACs)", a new large- tandem, palindromic or dispersed fashion(see below). These insert cloning system that proved to be more stable than the cosmids include large duplicated segments(50-500 kb) with high sequence and yeast artificial chromosomes(YACs) that had been used identity(98-99.9%), at which mispairing during recombination eviously. The pilot projects drove the maturation and conver- creates deletions responsible for genetic syndromes. Such features gence of sequencing strategies, while producing 15% of the human complicate the assembly of a correct and finished genome sequence genome sequence. With successful completion of this phase, the There are two approaches for sequencing large repeat-rich human genome sequencing effort moved into full-scale production genomes. The first is a whole-genome shotgun sequencing in march 1999 approach, as has been used for the repeat-poor genomes of viruses, The idea of first producing a draft genome sequence was revived bacteria and flies, using linking information and computational at this time, both because the ability to finish such a sequence was no longer in doubt and because there was great hunger in the scientific ommunity for human sequence data. In addition, some scientists Hierarchical shotgun sequencing favoured prioritizing the production of a draft genome sequence over regional finished sequence because of concerns about com- I that might be subject to undesirable restrictions on use" quence Genomic DNA nercial plans to generate proprietary databases of huma The consortium focused on an initial goal of producing, in a first production phase lasting until June 2000, a draft genome sequence overing most of the genome. Such a draft genome sequence, BAC library although not completely finished, would rapidly allow investigators dORseY to begin to extract most of the information in the human sequence Experiments showed that sequencing clones covering about 90% of organ the human genome to a redundancy of about four-to fivefold Chalf- clone contigs oal has been achieved as described belo The second sequence production phase is now under way. Its BAC to be aims are to achieve full-shotgun coverage of the existing clones sequenced during 2001, to obtain clones to fill the remaining gaps in the physical map, and to produce a finished sequence(apart from Shotgun regions that cannot be cloned or sequenced with currently available clones techniques)no later than 2003 Shotgun ..Ac Strategic issues TGATCATGCTTAAAcO AACCCTGTGCATCCTACTG oly .. ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG Hierarchical shotgun sequencing the fundamental method for ln as introduc ncing methods 7. s, the Figure 2 idealized representation of the hierarchical shotgun sequencing strategy. A Soon after the invention of dna it has remained library is constructed by fragmenting the target genome and cloning it into a large- genome sequ the past 20 years. The approach has been refined and ext lake it more efficient. For example, improved prote for clones are selected and sequenced by the random shotgun strategy. Finally,the clone fragmenting and cloning DNA allowed construction of shotgun sequences are assembled to reconstruct the sequence of the genome NATURE VOL 409 15 FEBRUARY 200 .nature. com A⊙2 mcmillan Magazines Ltd
well as targeted regions of mammalian genomes34±37. These projects showed that large-scale sequencing was feasible and developed the two-phase paradigm for genome sequencing. In the ®rst, `shotgun', phase, the genome is divided into appropriately sized segments and each segment is covered to a high degree of redundancy (typically, eight- to tenfold) through the sequencing of randomly selected subfragments. The second is a `®nishing' phase, in which sequence gaps are closed and remaining ambiguities are resolved through directed analysis. The results also showed that complete genomic sequence provided information about genes, regulatory regions and chromosome structure that was not readily obtainable from cDNA studies alone. In 1995, genome scientists considered a proposal38 that would have involved producing a draft genome sequence of the human genome in a ®rst phase and then returning to ®nish the sequence in a second phase. After vigorous debate, it was decided that such a plan was premature for several reasons. These included the need ®rst to prove that high-quality, long-range ®nished sequence could be produced from most parts of the complex, repeat-rich human genome; the sense that many aspects of the sequencing process were still rapidly evolving; and the desirability of further decreasing costs. Instead, pilot projects were launched to demonstrate the feasibility of cost-effective, large-scale sequencing, with a target completion date of March 1999. The projects successfully produced ®nished sequence with 99.99% accuracy and no gaps39. They also introduced bacterial arti®cial chromosomes (BACs)40, a new largeinsert cloning system that proved to be more stable than the cosmids and yeast arti®cial chromosomes (YACs)41 that had been used previously. The pilot projects drove the maturation and convergence of sequencing strategies, while producing 15% of the human genome sequence. With successful completion of this phase, the human genome sequencing effort moved into full-scale production in March 1999. The idea of ®rst producing a draft genome sequence was revived at this time, both because the ability to ®nish such a sequence was no longer in doubt and because there was great hunger in the scienti®c community for human sequence data. In addition, some scientists favoured prioritizing the production of a draft genome sequence over regional ®nished sequence because of concerns about commercial plans to generate proprietary databases of human sequence that might be subject to undesirable restrictions on use42±44. The consortium focused on an initial goal of producing, in a ®rst production phase lasting until June 2000, a draft genome sequence covering most of the genome. Such a draft genome sequence, although not completely ®nished, would rapidly allow investigators to begin to extract most of the information in the human sequence. Experiments showed that sequencing clones covering about 90% of the human genome to a redundancy of about four- to ®vefold (`halfshotgun' coverage; see Box 1) would accomplish this45,46. The draft genome sequence goal has been achieved, as described below. The second sequence production phase is now under way. Its aims are to achieve full-shotgun coverage of the existing clones during 2001, to obtain clones to ®ll the remaining gaps in the physical map, and to produce a ®nished sequence (apart from regions that cannot be cloned or sequenced with currently available techniques) no later than 2003. Strategic issues Hierarchical shotgun sequencing Soon after the invention of DNA sequencing methods47,48, the shotgun sequencing strategy was introduced49±51; it has remained the fundamental method for large-scale genome sequencing52±54 for the past 20 years. The approach has been re®ned and extended to make it more ef®cient. For example, improved protocols for fragmenting and cloning DNA allowed construction of shotgun libraries with more uniform representation. The practice of sequencing from both ends of double-stranded clones (`double-barrelled' shotgun sequencing) was introduced by Ansorge and others37 in 1990, allowing the use of `linking information' between sequence fragments. The application of shotgun sequencing was also extended by applying it to larger and larger DNA moleculesÐfrom plasmids (, 4 kilobases (kb)) to cosmid clones37 (40 kb), to arti®cial chromosomes cloned in bacteria and yeast55 (100±500 kb) and bacterial genomes56 (1±2 megabases (Mb)). In principle, a genome of arbitrary size may be directly sequenced by the shotgun method, provided that it contains no repeated sequence and can be uniformly sampled at random. The genome can then be assembled using the simple computer science technique of `hashing' (in which one detects overlaps by consulting an alphabetized look-up table of all k-letter words in the data). Mathematical analysis of the expected number of gaps as a function of coverage is similarly straightforward57. Practical dif®culties arise because of repeated sequences and cloning bias. Small amounts of repeated sequence pose little problem for shotgun sequencing. For example, one can readily assemble typical bacterial genomes (about 1.5% repeat) or the euchromatic portion of the ¯y genome (about 3% repeat). By contrast, the human genome is ®lled (. 50%) with repeated sequences, including interspersed repeats derived from transposable elements, and long genomic regions that have been duplicated in tandem, palindromic or dispersed fashion (see below). These include large duplicated segments (50±500 kb) with high sequence identity (98±99.9%), at which mispairing during recombination creates deletions responsible for genetic syndromes. Such features complicate the assembly of a correct and ®nished genome sequence. There are two approaches for sequencing large repeat-rich genomes. The ®rst is a whole-genome shotgun sequencing approach, as has been used for the repeat-poor genomes of viruses, bacteria and ¯ies, using linking information and computational articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 863 Genomic DNA BAC library Organized mapped large clone contigs BAC to be sequenced Shotgun clones Assembly Shotgun sequence ...ACCGTAAATGGGCTGATCATGCTTAAA ...ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG... TGATCATGCTTAAACCCTGTGCATCCTACTG... Hierarchical shotgun sequencing Figure 2 Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by fragmenting the target genome and cloning it into a largefragment cloning vector; here, BAC vectors are shown. The genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct the sequence of the genome. © 2001 Macmillan Magazines Ltd
articles analysis to attempt to avoid misassemblies. The second is the for clone-based information. Such analysis may help to refine hierarchical shotgun sequencing approach( Fig. 2), also referred sequencing strategies for other large genomes to as map-based,BAC-based or clone-by-clone. This approach Technology for large-scale sequencing (typically 100-200 kb each) covering the genome and separately improvements in the production and analysis of se -ay technological involves generating and organizing a set of large-insert clones Sequencing the human genome depended on ma data. Ke y erforming shotgun sequencing on appropriately chosen clones. innovations were developed both within and outside the Human Because the sequence information is local, the issue of long-range Genome Project. Laboratory innovations included four-colour misassembly is eliminated and the risk of short-range misassembly fluorescence-based sequence detection, improved fluorescent is reduced. One caveat is that some large-insert clones may suffer dyes-ce, dye-labelled terminators, polymer rearrangement, although this risk can be reduced by appropriate designed for sequencing6-7, cycle sequencing" and capillary gel uality-control measures involving clone fingerprints(see below). electrophoresis"-4. These studies contributed to substantial The two methods are likely to entail similar costs for producing improvements in the automation, quality and throughput of nished sequence of a mammalian genome. The hierarchical collecting raw DNA sequence?. 6. There were also important approach has a higher initial cost than the whole-genome approach, advances in the development of software packages for the analysis owing to the need to create a map of clones(about 1% of the total of sequence data. The PHRED software package".introduced the ost of sequencing)and to sequence overlaps between clones. On concept of assigning a base-quality score to each base, on the basis the other hand, the whole-genome approach is likely to require of the probability of an erroneous call. These quality scores make it nuch greater work and expense in the final stage of producing a possible to monitor raw data quality and also assist in determining finished sequence, because of the challenge of resolving misassem- whether two similar sequences truly overlap. The PHRAP computer bliesBothmethodsmustalsodealwithcloningbiasesresultinginpackage(http://bozeman.mbt.washington.edu/phrap.docs/phrap under-representation of some regions in either large-insert or html) then systematically assembles the sequence data using the small-insert clone libraries base-quality scores. The program assigns 'assembly-quality scores There was lively scientific debate over whether the human to each base in the assembled sequence, providing an objective archical shotgun sequencing. Weber and Myers stimulated these on and validated by extensive experimental dat scores were based genome sequencing effort should employ whole-genome or hier- criterion to guide sequence finishing. The qualit discussions with a specific proposal for a whole-genome shotgun Another key innovation for scaling up sequencing was the approach, together with an analysis suggesting that the method development by several centres of automated methods for sample could work and be more efficient. Green challenged these conclu- preparation. This typically involved creating new biochemical sions and argued that the potential benefits did not outweigh the protocols suitable for automation, followed by construction of likely risks appropriate robotic systems. In the end, we concluded that the human genome seq Coordination and public data sharing effort should employ the hierarchical approach for several reasons. The Human Genome Project adopted two important principles First, it was prudent to use the approach for the first project to with regard to human sequencing. The first was that the collabora- sequence a repeat-rich genome With the hierarchical approach, the tion would be open to centres from any nation. Although potentially ultimate frequency of misassembly in the finished product would less efficient, in a narrow economic sense, than a centralized probably be lower than with the whole-genome approach, in which approach involving a few large factories, the inclusive approach it would be difficult to identify regions in which the assembly was strongly favoured because we felt that the human sequence is the common heritage of all humanity and the work .. Second, it was prudent to use the approach in dealing with an should transcend national boundaries, and we believed that Itbred organism, such as the human. In the whole-genome shot- scientific progress was best assured by a diversity of approaches gun method, sequence would necessarily come from two different The collaboration was coordinated through periodic international ies of the human genome. Accurate sequence assembly could be meetings(referred to as ' Bermuda meetings after the venue of the uence variation between these two copies-both first three gatherings)and regular telephone conferences. Work was SNPs(which occur at a rate of I per 1, 300 bases)and scale shared flexibly among the centres, with some groups focusing on structural heterozygosity(which has been documented in human particular chromosomes and others contributing in a genome-wide chromosomes). In the hierarchical shotgun method, each large- fashion. insert clone is derived from a single haplotype. The second principle was rapid and unrestricted data release. The Third, the hierarchical method would be better able to deal with centres adopted a policy that all genomic sequence data should be inevitable cloning biases, because it would more readily allow made publicly available without restriction within 24 hours of argeting of additional sequencing to under-represented regions. assembly". Pre-publication data releases had been pioneered And fourth, it was better suited to a project shared among members mapping projects in the wormand mouse genomes"s and were of a diverse international consortium, because it allowed work and prominently adopted in the sequencing of the worm, providing a responsibility to be easily distributed. As the ultimate goal has direct model for the human sequencing efforts. We believed that always been to create a high-quality, finished sequence to serve as a scientific progress would be most rapidly advanced by immediate foundation for biomedical research, we reasoned that the advan- and free availability of the human genome sequence. The explosion tages of this more conservative approach outweighed the additional of scientific work based on the publicly available sequence data in cost, if any. oth academia and industry has confirmed this judgement. a biotechnology company, Celera Genomics, has chosen to incorporate the whole-genome shotgun approach into its own Generating the draft genome sequence efforts to sequence the human genome. Their plan obl uses a ixed strategy, involving combining some coverage with whole- Generating a draft sequence of the human genome involved three publicly available hierarchical shotgun data generated by the Inter- and assembling the individual sequenced clones into an overall draf national Human Genome Sequencing Consortium. If the raw genome sequence. a glossary of terms related to genome sequencing sequence reads from the whole-genome shot omponent are and assembly is provided in Box 1 made available, it may be possible to evaluate the extent to which the The draft genome sequence is a dynamic product, which is sequence of the human genome can be assembled without the need regularly updated as additional data accumulate en route to the A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
analysis to attempt to avoid misassemblies. The second is the `hierarchical shotgun sequencing' approach (Fig. 2), also referred to as `map-based', `BAC-based' or `clone-by-clone'. This approach involves generating and organizing a set of large-insert clones (typically 100±200 kb each) covering the genome and separately performing shotgun sequencing on appropriately chosen clones. Because the sequence information is local, the issue of long-range misassembly is eliminated and the risk of short-range misassembly is reduced. One caveat is that some large-insert clones may suffer rearrangement, although this risk can be reduced by appropriate quality-control measures involving clone ®ngerprints (see below). The two methods are likely to entail similar costs for producing ®nished sequence of a mammalian genome. The hierarchical approach has a higher initial cost than the whole-genome approach, owing to the need to create a map of clones (about 1% of the total cost of sequencing) and to sequence overlaps between clones. On the other hand, the whole-genome approach is likely to require much greater work and expense in the ®nal stage of producing a ®nished sequence, because of the challenge of resolving misassemblies. Both methods must also deal with cloning biases, resulting in under-representation of some regions in either large-insert or small-insert clone libraries. There was lively scienti®c debate over whether the human genome sequencing effort should employ whole-genome or hierarchical shotgun sequencing. Weber and Myers58 stimulated these discussions with a speci®c proposal for a whole-genome shotgun approach, together with an analysis suggesting that the method could work and be more ef®cient. Green59 challenged these conclusions and argued that the potential bene®ts did not outweigh the likely risks. In the end, we concluded that the human genome sequencing effort should employ the hierarchical approach for several reasons. First, it was prudent to use the approach for the ®rst project to sequence a repeat-rich genome. With the hierarchical approach, the ultimate frequency of misassembly in the ®nished product would probably be lower than with the whole-genome approach, in which it would be more dif®cult to identify regions in which the assembly was incorrect. Second, it was prudent to use the approach in dealing with an outbred organism, such as the human. In the whole-genome shotgun method, sequence would necessarily come from two different copies of the human genome. Accurate sequence assembly could be complicated by sequence variation between these two copiesÐboth SNPs (which occur at a rate of 1 per 1,300 bases) and larger-scale structural heterozygosity (which has been documented in human chromosomes). In the hierarchical shotgun method, each largeinsert clone is derived from a single haplotype. Third, the hierarchical method would be better able to deal with inevitable cloning biases, because it would more readily allow targeting of additional sequencing to under-represented regions. And fourth, it was better suited to a project shared among members of a diverse international consortium, because it allowed work and responsibility to be easily distributed. As the ultimate goal has always been to create a high-quality, ®nished sequence to serve as a foundation for biomedical research, we reasoned that the advantages of this more conservative approach outweighed the additional cost, if any. A biotechnology company, Celera Genomics, has chosen to incorporate the whole-genome shotgun approach into its own efforts to sequence the human genome. Their plan60,61 uses a mixed strategy, involving combining some coverage with wholegenome shotgun data generated by the company together with the publicly available hierarchical shotgun data generated by the International Human Genome Sequencing Consortium. If the raw sequence reads from the whole-genome shotgun component are made available, it may be possible to evaluate the extent to which the sequence of the human genome can be assembled without the need for clone-based information. Such analysis may help to re®ne sequencing strategies for other large genomes. Technology for large-scale sequencing Sequencing the human genome depended on many technological improvements in the production and analysis of sequence data. Key innovations were developed both within and outside the Human Genome Project. Laboratory innovations included four-colour ¯uorescence-based sequence detection62, improved ¯uorescent dyes63±66, dye-labelled terminators67, polymerases speci®cally designed for sequencing68±70, cycle sequencing71 and capillary gel electrophoresis72±74. These studies contributed to substantial improvements in the automation, quality and throughput of collecting raw DNA sequence75,76. There were also important advances in the development of software packages for the analysis of sequence data. The PHRED software package77,78 introduced the concept of assigning a `base-quality score' to each base, on the basis of the probability of an erroneous call. These quality scores make it possible to monitor raw data quality and also assist in determining whether two similar sequences truly overlap. The PHRAP computer package (http://bozeman.mbt.washington.edu/phrap.docs/phrap. html) then systematically assembles the sequence data using the base-quality scores. The program assigns `assembly-quality scores' to each base in the assembled sequence, providing an objective criterion to guide sequence ®nishing. The quality scores were based on and validated by extensive experimental data. Another key innovation for scaling up sequencing was the development by several centres of automated methods for sample preparation. This typically involved creating new biochemical protocols suitable for automation, followed by construction of appropriate robotic systems. Coordination and public data sharing The Human Genome Project adopted two important principles with regard to human sequencing. The ®rst was that the collaboration would be open to centres from any nation. Although potentially less ef®cient, in a narrow economic sense, than a centralized approach involving a few large factories, the inclusive approach was strongly favoured because we felt that the human genome sequence is the common heritage of all humanity and the work should transcend national boundaries, and we believed that scienti®c progress was best assured by a diversity of approaches. The collaboration was coordinated through periodic international meetings (referred to as `Bermuda meetings' after the venue of the ®rst three gatherings) and regular telephone conferences. Work was shared ¯exibly among the centres, with some groups focusing on particular chromosomes and others contributing in a genome-wide fashion. The second principle was rapid and unrestricted data release. The centres adopted a policy that all genomic sequence data should be made publicly available without restriction within 24 hours of assembly79,80. Pre-publication data releases had been pioneered in mapping projects in the worm11 and mouse genomes30,81 and were prominently adopted in the sequencing of the worm, providing a direct model for the human sequencing efforts. We believed that scienti®c progress would be most rapidly advanced by immediate and free availability of the human genome sequence. The explosion of scienti®c work based on the publicly available sequence data in both academia and industry has con®rmed this judgement. Generating the draft genome sequence Generating a draft sequence of the human genome involved three steps: selecting the BAC clones to be sequenced, sequencing them and assembling the individual sequenced clones into an overall draft genome sequence. A glossary of terms related to genome sequencing and assembly is provided in Box 1. The draft genome sequence is a dynamic product, which is regularly updated as additional data accumulate en route to the articles 864 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com