articles ultimate goal of a completely finished sequence. The results below partial digestion of genomic DNA with restriction enzymes. are based on the map and sequence data available on 7 October Together, they represent around 65-fold coverage(redundant sam- 2000, except as otherwise noted. At the end of this section, we pling) of the genome. Libraries based on other vectors, such as provide a brief update of key data cosmids, were also used in early stages of the project. Clone selection The libraries(Table 1)were prepared from DNA obtained from e hierarchical shotgun method involves the sequencing of over- anonymous human donors in accordance with US Federal R lapping large-insert clones spanning the genome. For the Human lations for the Protection of Human Subjects in Research Genome Project, clones were largely chosen from eight large-insert (45CFR46)and following full review by an Institutional Review libraries containing BAC or Pl-derived artificial chromosome Board. Briefly, the opportunity to donate DNA for this purpose was (PAC)clones(Table 1; refs 82-88). The libraries were made by broadly advertised near the two laboratories engaged in library BoX Sequence Sequenced-clone contigs Contigs produced by merging over Raw sequence Individual unassembled sequence reads, produced lapping sequenced clones by sequencing of clones containing DNA inserts. Paired-end sequence Raw sequence obtained from both ends of a ing sequenced-clone contigs on the basis of linking information. cloned insert in any vector, such as a plasmid or bacterial artificial Draft genome sequence The sequence produced by combining mosor the information from the individual sequenced clones (by creating Finished sequence Complete sequence of a clone or genome, with merged sequence contigs and then employing linking information to an accuracy of at least 99.99% and no gaps create scaffolds)and positioning the sequence along the physical map ot Coverage (or depth) The average number of times a nucleotide is the chromosomes. represented by a high-quality base in a collection of random raw N50 length A measure of the contig length (or scaffold length) equence. Operationally, a high-quality base is defined as one with an containing a 'typical nucleotide. Specifically, it is the maximum length L accuracy of at least 99%(corresponding to a PHRED score of at least 20). such that 50%of all nucleotides lie in contigs (or scaffolds)of size at least L Full shotgun coverage The coverage in random raw sequence Computer programs and databases centres but is typically 8-10-fold. Clones with full shotgun to produce a 'base call with an associated quality score'for eachCs needed from a large-insert clone to ensure that it is ready for finishing; this PHRED Awidely used computer program that analyses raw sequence coverage can usually be assembled with only a handful of gaps per position in the sequence. A PHRED quality score of X corresponds to an 00kb. error probability of approximately 10. Thus, a PHRED quality score of Half shotgun coverage Half the amount of full shotgun coverage 30 corresponds to 99.9% accuracy for the base call in the raw read (typically, 4-5-fold random coverage PHRAP A widely used computer program that assembles raw ce contigs and assigns to each position in the BAC clone Bacterial artificial chromosome vector carying a genomic sequence an associated 'quality score, on the basis of the PHRED DNA insert, typically 100-200 kb. Most of the large-insert clones scores of the raw sequence reads A PHRAP quality score of X sequenced in the project were BAC clones. orresponds to an error probability of approximately 10.Thus, a Finished clone A large-insert clone that is entirely represented by PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in finished sequence. the assembled sequence Full shotgun clone A large-insert clone for which full shotgun GigAssembler A computer program developed during this project equence has been produced. for merging the information from individual sequenced clones into a draft Draft clone A large-insert clone for which roughly half-shotgun genome sequence. sequence has been produced. Operationally, the collection of draft Public sequence databases The three coordinated international clones produced by each centre was required to have an average sequence databases: GenBank, the EMBL data library and DDBJ coverage of fourfold for the entire set and a minimum coverage of Map features threefold for each clone STS Sequence tagged site, corresponding to a short (typically less Predraft clone A large-insert clone for which some shotgun than 500 bp) unique genomic locus for which a polymerase chain sequence is available, but which does not meet the standards for reaction assay has been developed inclusion in the collection of draft clones EST Expressed sequence tag, obtained by performing a single raw Contigs and scaffolds uence read from a random complementary DNA clone. ontig The result of joining an overlapping collection of sequences or SsR Simple sequence repeat, a sequence consisting largely of a ones tandem repeat of a specific k-mer(such as(CA)15). Many SSRs are caffold The result of connecting contigs by linking infomation from polymorphic and have been widely used in genetic mapping and oriented with respect to one another. present at appreciable frequency(traditionally, at least 1%)in the human Fingerprint clone contigs Contigs produced by joining clones population ferred to overlap on the basis of their restriction digest fingerprints Genetic map A genome map in which polymorphic loci are Sequenced-clone layout Assignment of sequenced clones to the positioned relative to one another on the basis of the frequency with nap of fingerprint clone which they recombine during meiosis. The unit of distance is Initial sequence contigs Contigs produced by merging over centimorgans (cM), denoting a 1% chance of recombination ping sequence reads obtained from a single clone, in a process called Radiation hybrid ( RH)map A genome map in which STSs are positioned relative to one another on the basis of the frequency with erged sequence contigs Contigs produced by taking the initial which they are separated by radiation-induced breaks. The frequency is sequence contigs contained in overlapping clones and merging those assayed by analysing a panel of human-hamster hybrid cell lines, each found to overlap. These are also referred to simply as sequence contigs oduced by lethally irradiating human cells and fusing them with where no confusion will result pient hamster cells such that each cames a collection of human Sequence-contig scaffolds Scaffolds pre onnect ing hromosomal fragments. The unit of distance is centirays (cR), denoting sequence contigs on the basis of linking inform a 1% chance of a break occuring between two loci NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd
ultimate goal of a completely ®nished sequence. The results below are based on the map and sequence data available on 7 October 2000, except as otherwise noted. At the end of this section, we provide a brief update of key data. Clone selection The hierarchical shotgun method involves the sequencing of overlapping large-insert clones spanning the genome. For the Human Genome Project, clones were largely chosen from eight large-insert libraries containing BAC or P1-derived arti®cial chromosome (PAC) clones (Table 1; refs 82±88). The libraries were made by partial digestion of genomic DNA with restriction enzymes. Together, they represent around 65-fold coverage (redundant sampling) of the genome. Libraries based on other vectors, such as cosmids, were also used in early stages of the project. The libraries (Table 1) were prepared from DNA obtained from anonymous human donors in accordance with US Federal Regulations for the Protection of Human Subjects in Research (45CFR46) and following full review by an Institutional Review Board. Brie¯y, the opportunity to donate DNA for this purpose was broadly advertised near the two laboratories engaged in library articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 865 Box 1 Genome glossary Sequence Raw sequence Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts. Paired-end sequence Raw sequence obtained from both ends of a cloned insert in any vector, such as a plasmid or bacterial arti®cial chromosome. Finished sequence Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps. Coverage (or depth) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a `high-quality base' is de®ned as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20). Full shotgun coverage The coverage in random raw sequence needed from a large-insert clone to ensure that it is ready for ®nishing; this varies among centres but is typically 8±10-fold. Clones with full shotgun coverage can usually be assembled with only a handful of gaps per 100 kb. Half shotgun coverage Half the amount of full shotgun coverage (typically, 4±5-fold random coverage). Clones BAC clone Bacterial arti®cial chromosome vector carrying a genomic DNA insert, typically 100±200 kb. Most of the large-insert clones sequenced in the project were BAC clones. Finished clone A large-insert clone that is entirely represented by ®nished sequence. Full shotgun clone A large-insert clone for which full shotgun sequence has been produced. Draft clone A large-insert clone for which roughly half-shotgun sequence has been produced. Operationally, the collection of draft clones produced by each centre was required to have an average coverage of fourfold for the entire set and a minimum coverage of threefold for each clone. Predraft clone A large-insert clone for which some shotgun sequence is available, but which does not meet the standards for inclusion in the collection of draft clones. Contigs and scaffolds Contig The result of joining an overlapping collection of sequences or clones. Scaffold The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another. Fingerprint clone contigs Contigs produced by joining clones inferred to overlap on the basis of their restriction digest ®ngerprints. Sequenced-clone layout Assignment of sequenced clones to the physical map of ®ngerprint clone contigs. Initial sequence contigs Contigs produced by merging overlapping sequence reads obtained from a single clone, in a process called sequence assembly. Merged sequence contigs Contigs produced by taking the initial sequence contigs contained in overlapping clones and merging those found to overlap. These are also referred to simply as `sequence contigs' where no confusion will result. Sequence-contig scaffolds Scaffolds produced by connecting sequence contigs on the basis of linking information. Sequenced-clone contigs Contigs produced by merging overlapping sequenced clones. Sequenced-clone-contig scaffolds Scaffolds produced by joining sequenced-clone contigs on the basis of linking information. Draft genome sequence The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes. N50 length A measure of the contig length (or scaffold length) containing a `typical' nucleotide. Speci®cally, it is the maximum length L suchthat 50% of all nucleotides lie in contigs (or scaffolds) of size at least L. Computer programs and databases PHRED A widely used computer program that analyses raw sequence to produce a `base call' with an associated `quality score' for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10- X/10. Thus, a PHRED quality score of 30 corresponds to 99.9% accuracy for the base call in the raw read. PHRAP A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated `quality score', on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10- X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence. GigAssembler A computer program developed during this project for merging the information from individual sequenced clones into a draft genome sequence. Public sequence databases The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ. Map features STS Sequence tagged site, corresponding to a short (typically less than 500 bp) unique genomic locus for which a polymerase chain reaction assay has been developed. EST Expressed sequence tag, obtained by performing a single raw sequence read from a random complementary DNA clone. SSR Simple sequence repeat, a sequence consisting largely of a tandem repeat of a speci®c k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping. SNP Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population. Genetic map A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination. Radiation hybrid (RH) map A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human±hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chance of a break occuring between two loci. © 2001 Macmillan Magazines Ltd
articles construction. Volunteers of diverse backgrounds were accepted on a RPCI-13 and CalTech D libraries(Table 1). DNA from each BAC first-come, first-taken basis Samples were obtained after discussion clone was digested with the restriction enzyme HindIll, and the sizes ith a genetic counsellor and written informed consent. The of the resulting fragments were measured by agarose gel electro- samples were made anonymous as follows: the sampling laboratory phoresis. The pattern of restriction fragments provides a ' finger stripped all identifiers from the samples, applied random numeric print for each BAC, which allows different BACs to be distinguished labels, and transferred them to the processing laboratory, which and the degree of overlaps to be assessed. We used these restriction- hen removed all labels and relabelled the samples. All records of the fragment fingerprints to determine clone overlaps, and thereby labelling were destroyed. The processing laboratory chose samples assembled the BACs into fingerprint clone contigs at random from which to prepare DNA and immortalized cell lines. The fingerprint clone contigs were positioned along the chromo- Around 5-10 samples were collected for every one that was somes by anchoring them with STS markers from existing genetic ventually used. Because no link was retained between donor and and physical maps. Fingerprint clone contigs were tied to specific DNA sample, the identity of the donors for the libraries is not STSs initially by probe hybridization and later by direct search of the known, even by the donors themselves. A more complete descrip- sequenced clones. To localize fingerprint clone contigs that did not tioncanbefoundathttp://www.nhgri.nih.gov/grant_info/fuNd-containknownmarkersnewStsSweregeneratedandplacedonto ing/Statements/RFA/human_subjects. htmL. chromosomes.Representative clones were also positioned by fluor- During the pilot phase, centres showed that sequence-tagged sites escence in situ hybridization(FISH)(ref. 86 and C. McPherson, (STSs)from previously constructed genetic and physical maps unpublished) t data were dditional probes from flow sorting of chromosomes to obtain reviewed.g to evaluate overlaps and to assess cove rage of specific chromosomes or chromosomal bias against rearranged clones,). STS content information and regions BAC end sequence information were also used. Where possible, For the large-scale sequence production phase, a genome-wide we tried to select a minimally overlapping set spanning a region hysical map of overlapping clones was also cor ted by sys- However, because the genome-wide physical map was constructed tematic analysis of BAC clones representing 20-fold coverage of the concurrently with the sequencing, continuity in many regions wa human genome Most clones came from the first three sections of low in early stages. These small fingerprint clone contigs were the RPCI-11 library, supplemented with clones from sections of the nonetheless useful in identifying validated, nonredundant clones Table 1 Key large-insert genome-wide libraries Library name" GenBank Vector Source DNA Lit umber Number of abbrevation type om日 the draft genome Number Total bases fraction af library BAC Hind‖ 0021 Caltech D1 TD BAC Human 3811,36718560043 2,566-267 3,000-3253EcoF RPC1-1 3.388 RPCI- 267,931379773 ECoRI 321312 252413.9089 0916 eight libraries Total all Bbraries 354510 2984,2605 nds, more than 95% of both end sequences contained at least 100 bp of nonrepetitive sequence BAC-end amia nstitute of Technology and the University of Washington High Throughput Sequencing cente fortheTablewerehttp://www.ncbi.nm.nihgow/ganome/clone/ sthesEaretheclonesinthesequenced-clonelayoutmaphttp://genome.wustl.edw/gsc/human/apping/index.shtmlthatwerepredraftdraftorfinished ojects; in addition, not all of the clones from completed chromosomes 21 and 22 were included here because only the avail equence from those chromosomes was used in the assembly f The number reported is the tot 866 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
construction. Volunteers of diverse backgrounds were accepted on a ®rst-come, ®rst-taken basis. Samples were obtained after discussion with a genetic counsellor and written informed consent. The samples were made anonymous as follows: the sampling laboratory stripped all identi®ers from the samples, applied random numeric labels, and transferred them to the processing laboratory, which then removed all labels and relabelled the samples. All records of the labelling were destroyed. The processing laboratory chose samples at random from which to prepare DNA and immortalized cell lines. Around 5±10 samples were collected for every one that was eventually used. Because no link was retained between donor and DNA sample, the identity of the donors for the libraries is not known, even by the donors themselves. A more complete description can be found at http://www.nhgri.nih.gov/Grant_info/Funding/Statements/RFA/human_subjects.html. During the pilot phase, centres showed that sequence-tagged sites (STSs) from previously constructed genetic and physical maps could be used to recover BACs from speci®c regions. As sequencing expanded, some centres continued this approach, augmented with additional probes from ¯ow sorting of chromosomes to obtain long-range coverage of speci®c chromosomes or chromosomal regions89±94. For the large-scale sequence production phase, a genome-wide physical map of overlapping clones was also constructed by systematic analysis of BAC clones representing 20-fold coverage of the human genome86. Most clones came from the ®rst three sections of the RPCI-11 library, supplemented with clones from sections of the RPCI-13 and CalTech D libraries (Table 1). DNA from each BAC clone was digested with the restriction enzyme HindIII, and the sizes of the resulting fragments were measured by agarose gel electrophoresis. The pattern of restriction fragments provides a `®ngerprint' for each BAC, which allows different BACs to be distinguished and the degree of overlaps to be assessed. We used these restrictionfragment ®ngerprints to determine clone overlaps, and thereby assembled the BACs into ®ngerprint clone contigs. The ®ngerprint clone contigs were positioned along the chromosomes by anchoring them with STS markers from existing genetic and physical maps. Fingerprint clone contigs were tied to speci®c STSs initially by probe hybridization and later by direct search of the sequenced clones. To localize ®ngerprint clone contigs that did not contain known markers, new STSs were generated and placed onto chromosomes95. Representative clones were also positioned by ¯uorescence in situ hybridization (FISH) (ref. 86 and C. McPherson, unpublished). We selected clones from the ®ngerprint clone contigs for sequencing according to various criteria. Fingerprint data were reviewed86,90 to evaluate overlaps and to assess clone ®delity (to bias against rearranged clones83,96). STS content information and BAC end sequence information were also used91,92. Where possible, we tried to select a minimally overlapping set spanning a region. However, because the genome-wide physical map was constructed concurrently with the sequencing, continuity in many regions was low in early stages. These small ®ngerprint clone contigs were nonetheless useful in identifying validated, nonredundant clones articles 866 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Table 1 Key large-insert genome-wide libraries Library name* GenBank abbreviation Vector type Source DNA Library segment or plate numbers Enzyme digest Average insert size (kb) Total number of clones in library Number of ®ngerprinted clones² BAC-end sequence (ends/clones/ clones with both ends sequenced)³ Number of clones in genome layout§ Sequenced clones used in construction of the draft genome sequence Numberk Total bases (Mb)¶ Fraction of total from library Caltech B CTB BAC 987SK cells All HindIII 120 74,496 16 2/1/1 528 518 66.7 0.016 Caltech C CTC BAC Human sperm All HindIII 125 263,040 144 21,956/ 14,445/ 7,255 621 606 88.4 0.021 Caltech D1 (CITB-H1) CTD BAC Human sperm All HindIII 129 162,432 49,833 403,589/ 226,068/ 156,631 1,381 1,367 185.6 0.043 Caltech D2 (CITB-E1) BAC Human sperm All 2,501±2,565 EcoRI 202 24,960 2,566±2,671 EcoRI 182 46,326 3,000±3,253 EcoRI 142 97,536 RPCI-1 RP1 PAC Male, blood All MboI 110 115,200 3,388 1,070 1,053 117.7 0.028 RPCI-3 RP3 PAC Male, blood All MboI 115 75,513 644 638 68.5 0.016 RPCI-4 RP4 PAC Male, blood All MboI 116 105,251 889 881 95.5 0.022 RPCI-5 RP5 PAC Male, blood All MboI 115 142,773 1,042 1,033 116.5 0.027 RPCI-11 RP11 BAC Male, blood All 178 543,797 267,931 379,773/ 243,764/ 134,110 19,405 19,145 3,165.0 0.743 1 EcoRI 164 108,499 2 EcoRI 168 109,496 3 EcoRI 181 109,657 4 EcoRI 183 109,382 5 MboI 196 106,763 Total of top eight libraries 1,482,502 321,312 805,320/ 484,278/ 297,997 25,580 25,241 3,903.9 0.916 Total all libraries 354,510 812,594/ 488,017/ 100,775 30,445 29,298 4,260.5 1 ................................................................................................................................................................................................................................................................................................................................................................... * For the CalTech libraries82, see http://www.tree.caltech.edu/lib_status.html; for RPCI libraries83, see http://www.chori.org/bacpac/home.htm. ² For the FPC map and ®ngerprinting84±86, see http://genome.wustl.edu/gsc/human/human_database.shtml. ³ The number of raw BAC end sequences (clones/ends/clones with both ends sequenced) available for use in human genome sequencing. Typically, for clones in which sequence was obtained from both ends, more than 95% of both end sequences contained at least 100 bp of nonrepetitive sequence. BAC-end sequencing of RPCI-11 and of the CalTech libraries was done at The Institute for Genomic Research, the California Institute of Technology and the University of Washington High Throughput Sequencing Center. The sources for the Table were http://www.ncbi.nlm.nih.gov/genome/clone/ BESstat.shtml and refs 87, 88. § These are the clones in the sequenced-clone layout map (http://genome.wustl.edu/gsc/human/Mapping/index.shtml) that were pre-draft, draft or ®nished. k The number of sequenced clones used in the assembly. This number is less than that in the previous column owing to removal of a small number of obviously contaminated, combined or duplicated projects; in addition, not all of the clones from completed chromosomes 21 and 22 were included here because only the available ®nished sequence from those chromosomes was used in the assembly. ¶ The number reported is the total sequence from the clones indicated in the previous column. Potential overlap between clones was not removed here, but Ns were excluded. © 2001 Macmillan Magazines Ltd
articles h of new regions. The small clone, several centres routinely examined an initial sample of 96 raw or merged with others as sequence reads from each subclone library to evaluate possible the map matured. overlap with previously sequenced clones. The clones that make up the draft genome sequence therefore do Sequencing not constitute a minimally overlapping set-there is overlap and The selected clones were subjected to shotgun sequencing. Although redundancy in places. The cost of using suboptimal overlaps was the basic approach of shotgun sequencing is well established, the justified by the benefit of earlier availability of the draft genome details of implementation varied among the centres. For example, lence data. Minimizing the overlap between adjacent clones there were differences in the average insert size of the shotgun would have required completing the physical map before under- libraries, in the use of single-stranded or double-stranded cloning taking large-scale sequencing. In addition, the overlaps between vectors, and in sequencing from one end or both ends of each insert. BAC clones provide a rich collection of SNPs. More than 1. 4 million Centres differed in the fluorescent labels employed and in the degree SNPs have already been identified from clone overlaps and other to which they used dye-primers or dye-terminators. The sequence detectors included both slab gel- and capillary-based devices Because the sequencing project was shared among twenty centres Detailed protocols are available on the web sites of many of the insixcountriesitwasimportanttocoordinateselectionofclonesindividualcentres(urlscanbefoundatwww.nhgri.nih.gov/ across the centres. Most centres focused on particular chromosomes genomehub). The extent of automation also varied greatly or, in some cases, larger regions of the genome. We also maintained among the centres, with the most aggressive automation efforts a clone registry to track selected clones and their progress. In later resulting in factory-style systems able to process more than 100,000 phases, the global map provided an integrated view of the data from sequencing reactions in 12 hours(Fig. 3). In addition, centres ll centres, facilitating the distribution of effort to maximize cover- differed in the amount of raw sequence data typically obtained for age of the genome Before performing extensive sequencing on a each clone(so-called half-shotgun, full shotgun and finished sequence). Sequence information from the different centres could be directly integrated despite this diversity, because the data were Lm L analysed by a common computational procedure. Raw sequenc traces were processed and assembled with the PHRED and PHRAP software packages".(P. Green, unpublished). All assembled con- tigs of more than 2 kb were deposited in public databases within The overall sequencing output rose sharply during production (Fig. 4). Following installation of new sequence detectors beginning in June 1999, sequence acity and output rose approx eightfold in eight months to nearly 7 million samples processed per month, with little or no drop in success rate(ratio of useable reads human genome in less than six weeks. This corresponded to a continuous throughput exceeding 1,000 nucleotides per second, 24 hours per day, seven days per week. This scale-up resulted in a concomitant increase in the sequence available in the public A version of the draft genome sequence was prepared on the basis Figure 3 The automated production line for sample preparation at the whitehead of the map and sequence data available on 7 October 2000. For this Institute,Center for Genome Research. The system consists of custom-designed factory. version, the mapping effort had assembled the fingerprinted BACs style conveyor belt robots that perform all functions from purifying DNA from bacterial into 1, 246 fingerprint clone contigs. The sequencing effort had cultures through setting up and purifying sequencing reactions sequenced and assembled 29, 298 overlapping BACs and other large insert clones(Table 2), comprising a total length of 4.26 gigabases (Gb). This resulted from around 23 Gb of underlying raw shotgun sequence data, or about 7.5-fold coverage averaged across the 4,500 Finished genome(including both draft and finished sequence). The various Unfinished(draft and pre-d contributions to the total amount of sequence deposited in the HTGS division of Gen Bank are given in Table 3 Table 2 Total genome sequence from 2500 sequence status Sequent umber of Total clon number depth sequence(Mb) nis number di Figure 4 Total amount of human sequence in the High Throughput Genome Sequer sequencing centre. The average varies among the centres, and the number may rGS)division of GenBank. The total is the sum of finished sequence(red) and unfinished vary considerably for clones with the same sequencing status. For draft clones in the public draft plus predraft sequence yellow) NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd
that were used to `seed' the sequencing of new regions. The small ®ngerprint clone contigs were extended or merged with others as the map matured. The clones that make up the draft genome sequence therefore do not constitute a minimally overlapping setÐthere is overlap and redundancy in places. The cost of using suboptimal overlaps was justi®ed by the bene®t of earlier availability of the draft genome sequence data. Minimizing the overlap between adjacent clones would have required completing the physical map before undertaking large-scale sequencing. In addition, the overlaps between BAC clones provide a rich collection of SNPs. More than 1.4 million SNPs have already been identi®ed from clone overlaps and other sequence comparisons97. Because the sequencing project was shared among twenty centres in six countries, it was important to coordinate selection of clones across the centres. Most centres focused on particular chromosomes or, in some cases, larger regions of the genome. We also maintained a clone registry to track selected clones and their progress. In later phases, the global map provided an integrated view of the data from all centres, facilitating the distribution of effort to maximize coverage of the genome. Before performing extensive sequencing on a clone, several centres routinely examined an initial sample of 96 raw sequence reads from each subclone library to evaluate possible overlap with previously sequenced clones. Sequencing The selected clones were subjected to shotgun sequencing. Although the basic approach of shotgun sequencing is well established, the details of implementation varied among the centres. For example, there were differences in the average insert size of the shotgun libraries, in the use of single-stranded or double-stranded cloning vectors, and in sequencing from one end or both ends of each insert. Centres differed in the ¯uorescent labels employed and in the degree to which they used dye-primers or dye-terminators. The sequence detectors included both slab gel- and capillary-based devices. Detailed protocols are available on the web sites of many of the individual centres (URLs can be found at www.nhgri.nih.gov/ genome_hub). The extent of automation also varied greatly among the centres, with the most aggressive automation efforts resulting in factory-style systems able to process more than 100,000 sequencing reactions in 12 hours (Fig. 3). In addition, centres differed in the amount of raw sequence data typically obtained for each clone (so-called half-shotgun, full shotgun and ®nished sequence). Sequence information from the different centres could be directly integrated despite this diversity, because the data were analysed by a common computational procedure. Raw sequence traces were processed and assembled with the PHRED and PHRAP software packages77,78 (P. Green, unpublished). All assembled contigs of more than 2 kb were deposited in public databases within 24 hours of assembly. The overall sequencing output rose sharply during production (Fig. 4). Following installation of new sequence detectors beginning in June 1999, sequencing capacity and output rose approximately eightfold in eight months to nearly 7 million samples processed per month, with little or no drop in success rate (ratio of useable reads to attempted reads). By June 2000, the centres were producing raw sequence at a rate equivalent to onefold coverage of the entire human genome in less than six weeks. This corresponded to a continuous throughput exceeding 1,000 nucleotides per second, 24 hours per day, seven days per week. This scale-up resulted in a concomitant increase in the sequence available in the public databases (Fig. 4). A version of the draft genome sequence was prepared on the basis of the map and sequence data available on 7 October 2000. For this version, the mapping effort had assembled the ®ngerprinted BACs into 1,246 ®ngerprint clone contigs. The sequencing effort had sequenced and assembled 29,298 overlapping BACs and other largeinsert clones (Table 2), comprising a total length of 4.26 gigabases (Gb). This resulted from around 23 Gb of underlying raw shotgun sequence data, or about 7.5-fold coverage averaged across the genome (including both draft and ®nished sequence). The various contributions to the total amount of sequence deposited in the HTGS division of GenBank are given in Table 3. articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 867 Figure 3 The automated production line for sample preparation at the Whitehead Institute, Center for Genome Research. The system consists of custom-designed factorystyle conveyor belt robots that perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions. 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Jan-96 Apr-96 Jul-96 Oct-96 Jan-97 Apr-97 Jul-97 Oct-97 Jan-98 Apr-98 Jul-98 Oct-98 Jan-99 Apr-99 Jul-99 Oct-99 Jan-00 Apr-00 Jul-00 Oct-00 Sequence (Mb) Finished Unfinished (draft and pre-draft) Month Figure 4 Total amount of human sequence in the High Throughput Genome Sequence (HTGS) division of GenBank. The total is the sum of ®nished sequence (red) and un®nished (draft plus predraft) sequence (yellow). Table 2 Total genome sequence from the collection of sequenced clones, by sequence status Sequence status Number of clones Total clone length (Mb) Average number of sequence reads per kb* Average sequence depth² Total amount of raw sequence (Mb) Finished 8,277 897 20±25 8±12 9,085 Draft 18,969 3,097 12 4.5 13,395 Predraft 2,052 267 6 2.5 667 Total 23,147 ............................................................................................................................................................................. * The average number of reads per kb was estimated based on information provided by each sequencing centre. This number differed among sequencing centres, based on the actual protocols used. ² The average depth in high quality bases ($99% accuracy) was estimated from information provided by each sequencing centre. The average varies among the centres, and the number may vary considerably for clones with the same sequencing status. For draft clones in the public databases (keyword: HTGS_draft), the number can be computed from the quality scores listed in the database entry. © 2001 Macmillan Magazines Ltd
articles By agreement among the centres, the collection of draft clones In addition to sequencing large-insert clones, three centres produced by each centre was required to have fourfold average generated a large collection of random raw sequence reads from sequence coverage, with no clone below threefold. For this pur- whole-genome shotgun libraries (Table 4; ref. 98). These 5.77 pose, sequence coverage was defined as the average number of times million successful sequences contained 2. 4 Gb of high-quality that each base was independently read with a base-quality score bases; this corres to about 0.75-fold coverage and woul orresponding to at least 99%accuracy. ) We attained an overall statistically expected to include about 50% of the nucleotides in the averageof4.5-foldcoverageacrossthegenomefordraftclonesahumangenome(dataavailableathttp://snp.cshl.org/data).the few of the sequenced clones fell below the minimum of threefold primary objective of this work was to discover SNPs, by comparing s meeting draft standards; these are referred to as predraft(Table 2). uals) with the draft genome sequence. However, many of these raw Some of these are clones that span remaining gaps in the draft sequences were obtained from both ends of plasmid clones and genome sequence and were in the process of being sequenced on 7 thereby also provided valuable linking information that was used October 2000; a few are old submissions from centres that are no in sequence assembly. In addition, the random raw sequences longer active. provide sequence coverage of about half of the nucleotides not yet The lengths of the initial sequence contigs in the draft clones vary represented in the sequenced large-insert clones; these can be used a function of coverage, but half of all nucleotides reside in initial as probes for portions of the genome not yet recovered. nce contigs of at least 21.7 kb(see below ) Various properties Assembly of the draft genome sequence of the draft clones can be assessed from instances in which there was We then set out to assemble the sequences from the individual large substantial overlap between a draft clone and a finished (or nearly insert clones into an integrated draft sequence of the human the sequence alignments in the genome. The assembly process had to resolve problems arising overlap regions, we estimated that the initial sequence contigs in a from the draft nature of much of the sequence, from the variety of draft sequence clone cover an average of about 96% of the clone and clone sources, and from the high fraction of repeated sequences in are separated by gaps with an average size of about 500 bp the human genome. This process involved three steps: filtering, Although the main emphasis was on producing a draft genome layout and merging. sequence, the centres also maintained sequence finishing activities The entire data set was filtered uniformly to eliminate contam- during this period, leading to a twofold increase in finished ination from nonhuman sequences and other artefacts that had not sequence from June 1999 to June 2000(Fig. 4). The total amount already been removed by the individual centres (Information about of human sequence in this final form stood at more than 835 Mb on contamination was also sent back to the centres, which are updating 7 October 2000, or more than 25% of the human genome. This the individual entries in the public databases. )We also identified havebequences of chromosomes 21 and 22 (refs 93, instances in which the sequ data from one bac clone was 94). As centres have begun to shift from draft to finished sequene ubstantially contaminated with sequence data from another in the last quarter of 2000, the production of finished sequence has (human or nonhuman) clone. The problems were resolved in increased to an annualized rate of I Gb per year and is continuing to most instances; 231 clones remained unresolved, and these were eliminated from the assembly reported here. Instances of lower levels of cross-contamination(for example, a single 96-well micro- plate misassigned to the wrong BAC) are more difficult to detect Table 3 Total human sequence deposited in the htGs division of GenBank some undoubtedly remain and may give rise to small spurious Total human fnished human sequence contigs in the draft genome sequence. Such issues 阶 Center for Genome Researe1212da but they necessitate some caution in certain applications of the The sequenced clones were then associated with specific clones on Baylor Collage of Medicine Human Genome Sequencing 345, 125 ne physical map to produce a 'layout. In pri clones that correspond to fingerprinted BACs could be directly assigned by name to fingerprint clone contigs on the fingerprint- 7014 based physical map. In practice, however, laboratory mixups occa- epartment of Genome Analysis, nstitute of Molecular sionally resulted in incorrect assignments. To eliminate such pro- B297 blems, sequenced clones were associated with the fingerprint clone Systems 9.6876 contigs in the physical map by using the sequence data to calculate a 3,530 Read pairs Size range of inserts uthmwestern Medical Center at Dalas University of Oklahoma Advanced Center for Genome 9,155 eared 08-4.7 2,94 Total 766907 1,916294 GBF -German Research Centre for Biotechnology Cold Spring Harbor Laboratory Lita Annenberg Hazen 2 ymous 4338,224 mples are not id entiled. fomed consent知 les to the dna ers of the Intemational Human genome plus predraft is shown in the second co adding characters and of some clones doned fragment was determined and used in this study as linking information. 868 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
By agreement among the centres, the collection of draft clones produced by each centre was required to have fourfold average sequence coverage, with no clone below threefold. (For this purpose, sequence coverage was de®ned as the average number of times that each base was independently read with a base-quality score corresponding to at least 99% accuracy.) We attained an overall average of 4.5-fold coverage across the genome for draft clones. A few of the sequenced clones fell below the minimum of threefold sequence coverage or have not been formally designated by centres as meeting draft standards; these are referred to as predraft (Table 2). Some of these are clones that span remaining gaps in the draft genome sequence and were in the process of being sequenced on 7 October 2000; a few are old submissions from centres that are no longer active. The lengths of the initial sequence contigs in the draft clones vary as a function of coverage, but half of all nucleotides reside in initial sequence contigs of at least 21.7 kb (see below). Various properties of the draft clones can be assessed from instances in which there was substantial overlap between a draft clone and a ®nished (or nearly ®nished) clone. By examining the sequence alignments in the overlap regions, we estimated that the initial sequence contigs in a draft sequence clone cover an average of about 96% of the clone and are separated by gaps with an average size of about 500 bp. Although the main emphasis was on producing a draft genome sequence, the centres also maintained sequence ®nishing activities during this period, leading to a twofold increase in ®nished sequence from June 1999 to June 2000 (Fig. 4). The total amount of human sequence in this ®nal form stood at more than 835 Mb on 7 October 2000, or more than 25% of the human genome. This includes the ®nished sequences of chromosomes 21 and 22 (refs 93, 94). As centres have begun to shift from draft to ®nished sequencing in the last quarter of 2000, the production of ®nished sequence has increased to an annualized rate of 1 Gb per year and is continuing to rise. In addition to sequencing large-insert clones, three centres generated a large collection of random raw sequence reads from whole-genome shotgun libraries (Table 4; ref. 98). These 5.77 million successful sequences contained 2.4 Gb of high-quality bases; this corresponds to about 0.75-fold coverage and would be statistically expected to include about 50% of the nucleotides in the human genome (data available at http://snp.cshl.org/data). The primary objective of this work was to discover SNPs, by comparing these random raw sequences (which came from different individuals) with the draft genome sequence. However, many of these raw sequences were obtained from both ends of plasmid clones and thereby also provided valuable `linking' information that was used in sequence assembly. In addition, the random raw sequences provide sequence coverage of about half of the nucleotides not yet represented in the sequenced large-insert clones; these can be used as probes for portions of the genome not yet recovered. Assembly of the draft genome sequence We then set out to assemble the sequences from the individual largeinsert clones into an integrated draft sequence of the human genome. The assembly process had to resolve problems arising from the draft nature of much of the sequence, from the variety of clone sources, and from the high fraction of repeated sequences in the human genome. This process involved three steps: ®ltering, layout and merging. The entire data set was ®ltered uniformly to eliminate contamination from nonhuman sequences and other artefacts that had not already been removed by the individual centres. (Information about contamination was also sent back to the centres, which are updating the individual entries in the public databases.) We also identi®ed instances in which the sequence data from one BAC clone was substantially contaminated with sequence data from another (human or nonhuman) clone. The problems were resolved in most instances; 231 clones remained unresolved, and these were eliminated from the assembly reported here. Instances of lower levels of cross-contamination (for example, a single 96-well microplate misassigned to the wrong BAC) are more dif®cult to detect; some undoubtedly remain and may give rise to small spurious sequence contigs in the draft genome sequence. Such issues are readily resolved as the clones progress towards ®nished sequence, but they necessitate some caution in certain applications of the current data. The sequenced clones were then associated with speci®c clones on the physical map to produce a `layout'. In principle, sequenced clones that correspond to ®ngerprinted BACs could be directly assigned by name to ®ngerprint clone contigs on the ®ngerprintbased physical map. In practice, however, laboratory mixups occasionally resulted in incorrect assignments. To eliminate such problems, sequenced clones were associated with the ®ngerprint clone contigs in the physical map by using the sequence data to calculate a articles 868 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Table 3 Total human sequence deposited in the HTGS division of GenBank Sequencing centre Total human sequence (kb) Finished human sequence (kb) Whitehead Institute, Center for Genome Research* 1,196,888 46,560 The Sanger Centre* 970,789 284,353 Washington University Genome Sequencing Center* 765,898 175,279 US DOE Joint Genome Institute 377,998 78,486 Baylor College of Medicine Human Genome Sequencing Center 345,125 53,418 RIKEN Genomic Sciences Center 203,166 16,971 Genoscope 85,995 48,808 GTC Sequencing Center 71,357 7,014 Department of Genome Analysis, Institute of Molecular Biotechnology 49,865 17,788 Beijing Genomics Institute/Human Genome Center 42,865 6,297 Multimegabase Sequencing Center; Institute for Systems Biology 31,241 9,676 Stanford Genome Technology Center 29,728 3,530 The Stanford Human Genome Center and Department of Genetics 28,162 9,121 University of Washington Genome Center 24,115 14,692 Keio University 17,364 13,058 University of Texas Southwestern Medical Center at Dallas 11,670 7,028 University of Oklahoma Advanced Center for Genome Technology 10,071 9,155 Max Planck Institute for Molecular Genetics 7,650 2,940 GBF ± German Research Centre for Biotechnology 4,639 2,338 Cold Spring Harbor Laboratory Lita Annenberg Hazen Genome Center 4,338 2,104 Other 59,574 35,911 Total 4,338,224 842,027 ............................................................................................................................................................................. Total human sequence deposited in GenBank by members of the International Human Genome Sequencing Consortium, as of 8 October 2000.The amount of total sequence (®nished plus draft plus predraft) is shown in the second column and the amount of ®nished sequence is shown in the third column. Total sequence differs from totals in Tables 1 and 2 because of inclusion of padding characters and of some clones not used in assembly. HTGS, high throughput genome sequence. *These three centres produced an additional 2.4 Gb of raw plasmid paired-end reads (see Table 4), consisting of 0.99 Gb from Whitehead Institute, 0.66 Gb from The Sanger Centre and 0.75 Gb from Washington University. Table 4 Plasmid paired-end reads Total reads deposited* Read pairs² Size range of inserts (kb) Random-sheared 3,227,685 1,155,284 1.8±6 Enzyme digest 2,539,222 761,010 0.8±4.7 Total 5,766,907 1,916,294 ............................................................................................................................................................................. The plasmid paired-end reads used a mixture of DNA from a set of 24 samples from the DNA Polymorphism Discovery Resource (http://locus.umdnj.edu/nigms/pdr.html). This set of 24 anonymous US residents contains samples from European-Americans, African-Americans, MexicanAmericans, Native Americans and Asian-Americans, although the ethnicities of the individual samples are not identi®ed. Informed consent to contribute samples to the DNA Polymorphism Discovery Resource was obtained from all 450 individuals who contributed samples. Samples from the European-American, African-American and Mexican-American individuals came from NHANES (http://www.cdc.gov/nchs/nhanes.htm); individuals were recontacted to obtain their consent for the Resource project. New samples were obtained from Asian-Americans whose ancestry was from a variety of East and South Asian countries. New samples were also obtained for the Native Americans; tribal permission was obtained ®rst, and then individual consents. See http:// www.nhgri.nih.gov/Grant_info/Funding/RFA/discover_polymorphisms.html and ref. 98. *Re¯ects data deposited with and released by The SNP Consortium (see http://snp.cshl.org/data). ² Read pairs represents the number of cases in which sequence from both ends of a genomic cloned fragment was determined and used in this study as linking information. © 2001 Macmillan Magazines Ltd
articles partial list of restriction fragments in silico and comparing that list cHromosome ith the experimental database of BAC fingerprints. The compari on was feasible because the experim ing of restriction gments was highly accurate(to within 0.5-1.5% of the true ize, for 95% of fragments from 600 to 12, 000 base pairs(bp)54.ss. Reliable matching scores could be obtained for 16, 193 of the clones e remaining sequenced clones could not be placed on the map by this method because they were too short, or they contained too many small initial sequence contigs to yield enough restriction ragments, or possibly because their sequences were not represented in the fingerprint database. An independent approach to placing sequenced clones on the physical map used the database of end sequences from fingerprint BACs(Table 1). Sequenced clones could typically be reliably mapped if they contained multiple matches to BAC ends, with all corresponding to clones from a single genomic region(multiple matches were required as a safeguard against errors known to exist in the BAC end database and against repeated sequences). Thi approach provided useful placement information for 22, 566 Altogether, we could assign 25, 403 sequenced clones to finger print clone contigs by combining in silico digestion and BAC end sequence match data. To place most of the remaining sequenced clones, we exploited information about sequence overlap or BAC nd paired links of these clones with already positioned clones. This left only a few, mostly small, sequenced clones that could not be laced (152 sequenced clones containing 5. 5 Mb of sequence out of 29, 298 sequenced clones containing more than 4, 260 Mb of equence); these are being localized by radiation hybrid mapi f STSs derived from thei The fingerprint clone contigs were then mapped to chromosomal locations, using sequence matche %o..0 mapped STSs from four human radiation hybrid maps., 0, one YAC and radiation vo genetic maps gether with data from FISH,,o. The mapping was iteratively refined by comparing the order and orientation of the STSs in the fingerprint clone contigs nd the various STS-based maps, to identify and refine discrepan- cies(Fig. 5). Small fingerprint clone contigs(< 1 Mb)were difficult to orient and, sometimes, to order using these methods. In all, 942 fingerprint clone contigs contained sequenced clones. (An addi- tional 304 of the 1, 246 fingerprint clone contigs did not contain Figure 5 Positions of markers on previous maps of the genome(the Genethon'ogenetic lancedclonesbutthesetendedtobeextremelysmallandmapandMarshfieldgeneticmap(http://research.marshfieldclinic.org/genetics/ together contain less than 1% of the mapped clones. About one- genotyping_service/mgsver2 htm), the GeneMap99 radiation hybrid map 00, and the third have been targeted for sequencing. A few derive from the Y Whitehead YAC and radiation hybrid map2) plotted against their derived position on the chromosome, for which the map was constructed separately". Most draft sequence for chromosome 2. The horizontal units are Mb but the vertical units of of the remainder are fragments of other larger contigs or represent each map vary (CM, cR and so on) and thus all were scaled so that the entire map spans other artefacts. These are being eliminated in subsequent versions of the full vertical range Markers that map to other chromosomes are shown in the the database )Of these 942 contigs with sequenced clones, 852 chromosome lines at the top. The data sets generally follow the diagonal, indicating that (90%, containing 99.2% of the total sequence) were localized to order and orientation of the marker sets on the different maps largely agree(note that the specific chromosome locations in this way. An additional 51 two genetic maps are completely superimposed). In a, there are two segments(bars)that fingerprint clone contigs, containing 0.5% of the sequence, could are inverted in an earlier version draft sequence relative to all the other maps. b, The same be assigned to a specific chromosome but not to a precise position. chromosome after the information was used to reorient those two segments end-to-end middle only: not OK Figure 6 The key steps (a-d in assembling individual sequenced clones into the draft genome sequence. A1-A5 represent initial sequence contigs derived from shotgun sequencing of clone A, and B1-B6 are from clone b NatuReVoL409115FebRuAry2001www.nature.comAe2001MacmillanMagazinesLtd
partial list of restriction fragments in silico and comparing that list with the experimental database of BAC ®ngerprints. The comparison was feasible because the experimental sizing of restriction fragments was highly accurate (to within 0.5±1.5% of the true size, for 95% of fragments from 600 to 12,000 base pairs (bp))84,85. Reliable matching scores could be obtained for 16,193 of the clones. The remaining sequenced clones could not be placed on the map by this method because they were too short, or they contained too many small initial sequence contigs to yield enough restriction fragments, or possibly because their sequences were not represented in the ®ngerprint database. An independent approach to placing sequenced clones on the physical map used the database of end sequences from ®ngerprinted BACs (Table 1). Sequenced clones could typically be reliably mapped if they contained multiple matches to BAC ends, with all corresponding to clones from a single genomic region (multiple matches were required as a safeguard against errors known to exist in the BAC end database and against repeated sequences). This approach provided useful placement information for 22,566 sequenced clones. Altogether, we could assign 25,403 sequenced clones to ®ngerprint clone contigs by combining in silico digestion and BAC end sequence match data. To place most of the remaining sequenced clones, we exploited information about sequence overlap or BACend paired links of these clones with already positioned clones. This left only a few, mostly small, sequenced clones that could not be placed (152 sequenced clones containing 5.5 Mb of sequence out of 29,298 sequenced clones containing more than 4,260 Mb of sequence); these are being localized by radiation hybrid mapping of STSs derived from their sequences. The ®ngerprint clone contigs were then mapped to chromosomal locations, using sequence matches to mapped STSs from four human radiation hybrid maps95,99,100, one YAC and radiation hybrid map29, and two genetic maps101,102, together with data from FISH86,90,103. The mapping was iteratively re®ned by comparing the order and orientation of the STSs in the ®ngerprint clone contigs and the various STS-based maps, to identify and re®ne discrepancies (Fig. 5). Small ®ngerprint clone contigs (, 1 Mb) were dif®cult to orient and, sometimes, to order using these methods. In all, 942 ®ngerprint clone contigs contained sequenced clones. (An additional 304 of the 1,246 ®ngerprint clone contigs did not contain sequenced clones, but these tended to be extremely small and together contain less than 1% of the mapped clones. About onethird have been targeted for sequencing. A few derive from the Y chromosome, for which the map was constructed separately89. Most of the remainder are fragments of other larger contigs or represent other artefacts. These are being eliminated in subsequent versions of the database.) Of these 942 contigs with sequenced clones, 852 (90%, containing 99.2% of the total sequence) were localized to speci®c chromosome locations in this way. An additional 51 ®ngerprint clone contigs, containing 0.5% of the sequence, could be assigned to a speci®c chromosome but not to a precise position. articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 869 50 100 150 200 250 Chromosome 2 50 100 150 200 250 Map location Map location Chromosome 2 Chromosome 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y b Chromosome 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y a Genethon map Gene map Marshfield map YAC map Genethon map Gene map Marshfield map YAC map Figure 5 Positions of markers on previous maps of the genome (the Genethon101 genetic map and Marsh®eld genetic map (http://research.marsh®eldclinic.org/genetics/ genotyping_service/mgsver2.htm), the GeneMap99 radiation hybrid map100, and the Whitehead YAC and radiation hybrid map29) plotted against their derived position on the draft sequence for chromosome 2. The horizontal units are Mb but the vertical units of each map vary (cM, cR and so on) and thus all were scaled so that the entire map spans the full vertical range. Markers that map to other chromosomes are shown in the chromosome lines at the top.The data sets generally follow the diagonal, indicating that order and orientation of the marker sets on the different maps largely agree (note that the two genetic maps are completely superimposed). In a, there are two segments (bars) that are inverted in an earlier version draft sequence relative to all the other maps. b, The same chromosome after the information was used to reorient those two segments. A1 A1 A2 A2 A1 B1 A3 B3 A4 B6A5 B2 B4 B5 A2 A3 A4 A4 A5 A5 B1 B1 A3 B2 B2 B3 B3 B4 B4 B5 B5 B6 B6 a d b c end-to-end alignment : OK alignment in middle only : not OK Figure 6 The key steps (a±d) in assembling individual sequenced clones into the draft genome sequence. A1±A5 represent initial sequence contigs derived from shotgun sequencing of clone A, and B1±B6 are from clone B. © 2001 Macmillan Magazines Ltd