THE HUMAN GENOME nome, and even a modest error rate can entire human genome in a single facility, dent, nonbiased view of the genome. The sec- reduce the effectiveness of assembly. In we were able to ensure uniform quality ond approach involves clustering all of the frag- ddition, maintaining the validity of mate- standards and the cost advantages associat- ments to a region or chromosome on the basis pair information is absolutely critical for ed with automation, an economy of scale, of mapping information. The clustered data the algorithms described below. Procedural and process consistency were then shredded and subjected to computa- controls were established for maintaining tional assembly. Both approaches provided es- the validity of sequence mate-pairs as se- 2 Genome Assembly Strategy and ntially the same reconstruction of assembled quencing reactions proceeded through the Characterization process, including strict rules built into the Summary. We describe in this section the two DNA sequence win proper order and orienta- LIMS. The accuracy of sequence data pro- approaches that we used to assemble the ge- greater sequence coverage(fewer gaps) and duced by the Celera process was validated nome. One method involves the computational was the principal sequence used for the analysi in the course of the Drosophila genome combination of all sequence reads with shred- phase. In addition, we document the complete- project (26). By collecting data for the ded data from Gen Bank to generate an indepen- ness and correctness of this assembly process Potential Entry Points Potential Exit Points oces Human Sample Workflow Process - sample screening Tissue Samples DNA/RNA Extraction QC: size and clarity DNA/RNA (DNA Resources] /DNA Resources, (DNA Resources) QC, size concentration QC, insert size DNA/RNA(External) Libraries (DNA Resource Library Construction library complexity /DNA Resources/ (DNA Resources] 8cNEam5o Libraries QC: titer functional test Pre-Sequencing Fluorescently Labeled DNA Resource (Pre-Sequencing Labl C: monitor statistical Fluorescently Labeled Sequencing summary data Trace Files [NT] Sequencing Lab (Pre-Sequencing Lab) (Sequencing Lab/ vector contaminant Trace Files [UNIX load QCDS quality info Post-Sequencing creening [Content Systems/ 33sE= QC: byte count, External Fragments emove duplicates Proces Pre-Assembly IContent Systems·EDA Content Systems] /Content Systemsj QC: "gatekee External Trimmed syntax, duplicates Fragments Proto I/O File Generation_ quality values. Proto l/o files y Chromosome Proto l/o Files gatekeeper"run again Assembly Team QA review Assemblies [Informatics Research/ R/C Fig. 2. Flow diagram for sequencing pipeli lected, and processed in compliance with standard operating proc and da tau ith ot Maternal and extemal entities ac t dures, with a focus on quality within and across departments. Each ntrol measures, and responsible parties are indicated and are process has defined inputs and outputs with the capability to exchang further in the text. 1308 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
nome, and even a modest error rate can reduce the effectiveness of assembly. In addition, maintaining the validity of matepair information is absolutely critical for the algorithms described below. Procedural controls were established for maintaining the validity of sequence mate-pairs as sequencing reactions proceeded through the process, including strict rules built into the LIMS. The accuracy of sequence data produced by the Celera process was validated in the course of the Drosophila genome project (26). By collecting data for the entire human genome in a single facility, we were able to ensure uniform quality standards and the cost advantages associated with automation, an economy of scale, and process consistency. 2 Genome Assembly Strategy and Characterization Summary. We describe in this section the two approaches that we used to assemble the genome. One method involves the computational combination of all sequence reads with shredded data from GenBank to generate an independent, nonbiased view of the genome. The second approach involves clustering all of the fragments to a region or chromosome on the basis of mapping information. The clustered data were then shredded and subjected to computational assembly. Both approaches provided essentially the same reconstruction of assembled DNA sequence with proper order and orientation. The second method provided slightly greater sequence coverage (fewer gaps) and was the principal sequence used for the analysis phase. In addition, we document the completeness and correctness of this assembly process Fig. 2. Flow diagram for sequencing pipeline. Samples are received, selected, and processed in compliance with standard operating procedures, with a focus on quality within and across departments. Each process has defined inputs and outputs with the capability to exchange samples and data with both internal and external entities according to defined quality guidelines. Manufacturing pipeline processes, products, quality control measures, and responsible parties are indicated and are described further in the text. T H E H UMAN G ENOME 1308 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME d provide a comparison to the public gend 2.1 Assembly data sets equences. In the past 2 years the PFP has sequence, which was reconstructed largely by We used two independent sets of data for our focused on a product of lower quality and com- an independent BAC-by-BAC approach Our assemblies. The first was a random shotgun pleteness, but on a faster time-course, by con- assemblies effectively covered the euchromatic data set of 27.27 million reads of average length centrating on the production of Phase I data regions of the human chromosomes. More than 543 bp produced at Celera. This consisted from a 3X to 4X light-shotgun of each BAC 90% of the genome was in scaffold assemblies largely of mate-pair reads from 16 libraries clone of 100,000 bp or greater, and 25% of the ge. constructed from DNA samples taken from five We screened the bactig sequences for con- nome was in scaffolds of 10 million bp or different donors. Libraries with insert sizes of 2, taminants by using the blast algorithm 0, and 50 kbp were used. By looking at how against three data sets: (i) vector sequences mate pairs from a library were positioned inin Univec core (38), filtered for a 25-bl Shotgun sequence assembly is a classic known sequenced stretches of the genome, we match at 98% sequence identity at the ends example of an inverse problem: given a set were able to characterize the range of insert of the sequence and a 30-bp match internal of reads omly sampled from a target sizes in each library and determine a mean and to the sequence; (ii) the nonhuman portion sequence, reconstruct the order and the po- standard deviation. Table 1 details the number of the High Throughput Genomic(HTG) sition of those reads in the target Genome of reads, sequencing coverage, and clone cov- Seqences division of Gen Bank (39), fil- assembly algorithms developed for Dro- erage achieved by the data set. The clone cov- tered at 200 bp at 98%; and (iii)the non- sophila have now been extended to assemble erage is the coverage of the genome in cloned redundant nucleotide sequences from Gen- the "25-fold larger human genome. Celera as- DNA, considering the entire insert of each Bank without primate and human virus en- semblies consist of a set of contigs that are clone that has sequence from both ends. The tries, filtered at 200 bp at 98%.Whenever mapped to chromosomal locations by using amount of physical DNA coverage of the ge- 50 bp of the end of a contig, the ip up to 8 ordered and oriented into scaffolds that are then clone coverage provides a measure of the 25 bp or more of vector was found within lection of overlapping sequence reads that pro- Celera trimmed sequences gave a 51X cover- these criteria we removed 2.6 Mbp of pos- N vide a consensus reconstruction for a contigu- age of the genome, and clone coverage was sible contaminant and vector from the ous interval of the genome Mate pairs are a 3.42X, 1640X, and 18 84X for the 2, 10, and Phase 3 data, 61.0 Mbp from the Phase 1 entral component of the assembly strategy. 50-kbp libraries, respectively, for a total of and 2 data, and 16 1 Mbp from the Phase a size of gaps between consecutive contigs is The second data set was from the publicly 4363. 7 Mbp of PFP sequence data 20% ( known with reasonable precision. This is ac- funded Human Genome Project(PFP)and is finished, 75% rough-draft(Phase 1 and 2) arily derived from BAC clones (30). The and 5% single sequencing reads(Phase 0) one of which is in one contig, and the other of BAC data input to the assemblies came from a An additional 104, 018 BAC end-sequence which is in another, implies an orientation and download of GenBank on I September 2000 mate pairs were also downloaded and in- o distance between the two contigs(Fig 3). Fi-(Table 2) totaling 4443. 3 Mbp of sequence. cluded in the data sets for both assembly E nally, our assemblies did not incorporate all The data for each BAC is deposited at one of processes(18) reads into the final set of reported scaffolds. four levels of completion. Phase 0 data are a set This set of unincorporated reads is termed of generally unassembled sequencing reads 2. 2 Assembly strategies chaff, " and typically consisted of reads from from a very light shotgun of the BAC, typically Two different approaches to assembly were within highly repetitive regions, data from other less than 1x. Phase I data are unordered as- pursued. The first was a whole-genome as- organisms introduced through various routes as semblies of contigs, which we call BAC contigs sembly process that used Celera data and the found in many genome projects, and data of or bactigs. Phase 2 data are ordered assemblies PFP data in the form of additional synthetic a poor quality or with untrimmed vector. was a compart STS tioned the Celera and pfp data into sets o Mappe Genome localized to large chromosomal segments and 9 then performed ab initio shotgun assembly on 3 each set. Figure 4 gives a schematic of the overall process flow For the whole-genome assembly, the PFP data was first disassembled or shredded into a synthetic shotgun data set of 550-bp reads that form a perfect 2X covering of the bactigs. This Gap(mean std. dev. Known resulted in 16.05 million "faux reads that were sufficient to cover the genome 2.96X because of redundancy in the Bac data set, without Contig corporating the biases inherent in the PFP Consensus assembly process. The combined data set of Reads(of several haplotypes) 43.32 million reads(8X), and all associated mate-pair information, were then subjected to SNPS our whole-genome assembly algorithm to pro- BAC Fragments duce a reconstruction of the genome. Neither the location of a bac in the ernally derived reads ive different individuals(black lines)are combined to produce a assembly of bactigs was used in this process. contig and a ce (green line). Contigs are connected into scaffolds(red) by using Bactigs were are then mapped to the genome (gray line)with STS (blue star) found strong evidence that 2. 13% of them were misassembled (40). Furthermore, BAC location www.sciencemagorgSciEnceVol29116FebRuarY2001 1309
and provide a comparison to the public genome sequence, which was reconstructed largely by an independent BAC-by-BAC approach. Our assemblies effectively covered the euchromatic regions of the human chromosomes. More than 90% of the genome was in scaffold assemblies of 100,000 bp or greater, and 25% of the genome was in scaffolds of 10 million bp or larger. Shotgun sequence assembly is a classic example of an inverse problem: given a set of reads randomly sampled from a target sequence, reconstruct the order and the position of those reads in the target. Genome assembly algorithms developed for Drosophila have now been extended to assemble the ;25-fold larger human genome. Celera assemblies consist of a set of contigs that are ordered and oriented into scaffolds that are then mapped to chromosomal locations by using known markers. The contigs consist of a collection of overlapping sequence reads that provide a consensus reconstruction for a contiguous interval of the genome. Mate pairs are a central component of the assembly strategy. They are used to produce scaffolds in which the size of gaps between consecutive contigs is known with reasonable precision. This is accomplished by observing that a pair of reads, one of which is in one contig, and the other of which is in another, implies an orientation and distance between the two contigs (Fig. 3). Finally, our assemblies did not incorporate all reads into the final set of reported scaffolds. This set of unincorporated reads is termed “chaff,” and typically consisted of reads from within highly repetitive regions, data from other organisms introduced through various routes as found in many genome projects, and data of poor quality or with untrimmed vector. 2.1 Assembly data sets We used two independent sets of data for our assemblies. The first was a random shotgun data set of 27.27 million reads of average length 543 bp produced at Celera. This consisted largely of mate-pair reads from 16 libraries constructed from DNA samples taken from five different donors. Libraries with insert sizes of 2, 10, and 50 kbp were used. By looking at how mate pairs from a library were positioned in known sequenced stretches of the genome, we were able to characterize the range of insert sizes in each library and determine a mean and standard deviation. Table 1 details the number of reads, sequencing coverage, and clone coverage achieved by the data set. The clone coverage is the coverage of the genome in cloned DNA, considering the entire insert of each clone that has sequence from both ends. The clone coverage provides a measure of the amount of physical DNA coverage of the genome. Assuming a genome size of 2.9 Gbp, the Celera trimmed sequences gave a 5.13 coverage of the genome, and clone coverage was 3.423, 16.403, and 18.843 for the 2-, 10-, and 50-kbp libraries, respectively, for a total of 38.73 clone coverage. The second data set was from the publicly funded Human Genome Project (PFP) and is primarily derived from BAC clones (30). The BAC data input to the assemblies came from a download of GenBank on 1 September 2000 (Table 2) totaling 4443.3 Mbp of sequence. The data for each BAC is deposited at one of four levels of completion. Phase 0 data are a set of generally unassembled sequencing reads from a very light shotgun of the BAC, typically less than 13. Phase 1 data are unordered assemblies of contigs, which we call BAC contigs or bactigs. Phase 2 data are ordered assemblies of bactigs. Phase 3 data are complete BAC sequences. In the past 2 years the PFP has focused on a product of lower quality and completeness, but on a faster time-course, by concentrating on the production of Phase 1 data from a 33 to 43 light-shotgun of each BAC clone. We screened the bactig sequences for contaminants by using the BLAST algorithm against three data sets: (i) vector sequences in Univec core (38), filtered for a 25-bp match at 98% sequence identity at the ends of the sequence and a 30-bp match internal to the sequence; (ii) the nonhuman portion of the High Throughput Genomic (HTG) Seqences division of GenBank (39), filtered at 200 bp at 98%; and (iii) the nonredundant nucleotide sequences from GenBank without primate and human virus entries, filtered at 200 bp at 98%. Whenever 25 bp or more of vector was found within 50 bp of the end of a contig, the tip up to the matching vector was excised. Under these criteria we removed 2.6 Mbp of possible contaminant and vector from the Phase 3 data, 61.0 Mbp from the Phase 1 and 2 data, and 16.1 Mbp from the Phase 0 data (Table 2). This left us with a total of 4363.7 Mbp of PFP sequence data 20% finished, 75% rough-draft (Phase 1 and 2), and 5% single sequencing reads (Phase 0). An additional 104,018 BAC end-sequence mate pairs were also downloaded and included in the data sets for both assembly processes (18). 2.2 Assembly strategies Two different approaches to assembly were pursued. The first was a whole-genome assembly process that used Celera data and the PFP data in the form of additional synthetic shotgun data, and the second was a compartmentalized assembly process that first partitioned the Celera and PFP data into sets localized to large chromosomal segments and then performed ab initio shotgun assembly on each set. Figure 4 gives a schematic of the overall process flow. For the whole-genome assembly, the PFP data was first disassembled or “shredded” into a synthetic shotgun data set of 550-bp reads that form a perfect 23 covering of the bactigs. This resulted in 16.05 million “faux” reads that were sufficient to cover the genome 2.963 because of redundancy in the BAC data set, without incorporating the biases inherent in the PFP assembly process. The combined data set of 43.32 million reads (83), and all associated mate-pair information, were then subjected to our whole-genome assembly algorithm to produce a reconstruction of the genome. Neither the location of a BAC in the genome nor its assembly of bactigs was used in this process. Bactigs were shredded into reads because we found strong evidence that 2.13% of them were misassembled (40). Furthermore, BAC location Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and internally derived reads from five different individuals (black lines) are combined to produce a contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using mate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) physical map information. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1309 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME information was ignored because some BACs at least 2.2% of the BACs contained sequence (see below). In short, we performed a true, ab were not correctly placed on the PFP physical data that were not part of the given BAC (41), initio whole-genome assembly in which we map and because we found strong evidence that possibly as a result of sample- tracking errors took the expedient of deriving additional se- quence coverage, but not mate pairs, assembled Table 2. Gen Bank data input into assembly bactigs, or genome locality, from some exter- ally generated data. Completion phase sequence compartmentalized shotgun Center Statistics (CSA), Celera and PFP data were partitioned 0 1 and 2 3 into the largest possible chromosomal segments Whitehead Institute/ Number of accession records 2,825 363 or"components"that could be determined with MIT Center for lumber of contigs 243.786 3 confidence, and then shotgun assembly was ap- Genome Research, 194, 490, 158 1,083, 848, 245 48,829, 358 plied to each partitioned subset wherein the 1553,5 13,654 875, 618 2,202 bactig data were again shredded into faux reads 4417,055 to ensure an independent ab initio assembly of Average contig length(bp) the component. By subsetting the data in this ashington University, Number of accession records way, the overall computational effort was re- 3,232 USA 61812 duced and the effect of interchromosomal dupli- Total base pai 1,195.732561,171.788 cations was ameliorated This also resulted in a Total vector masked(bp) reconstruction of the genome that was relatively Total contaminant ma 224691,476,141 independent of the whole-genome assembly re- Average contig length(bp) 9,079 126,319 pared for consistency The quality of the parti College of Number of accession records 363 tioning into components was crucial so that N Total base pairs Total vector masked(bp) 0 1.,784,70 * 3960 gether. We constructed components from()the 218.769 Total contaminant masked Average contig length(bp) 5. 919 135.033 to Celera's data set. The BaC assemblies were Production Sequencing Number of accession records 754 obtained by a combining assembler that used the s Number of contigs 34938 4 bactigs and the 5X Celera data mapped to those Genome Institute Total base 8.680,214294.249,631 60975.328 bactigs as input. This effort was undertaken as Total vector masked(bp) 7,274 an interim step solely because the more accurate g 665,818 4,642372 118 387 and complete the scaffold for a given sequence E stretch, the more accurately one one can tile these age contig length(bp) 8422 scaffolds into contiguous components on the he Institute of Physical Number of accession records 1.149 Number of contigs 300 basis of sequence overlap and mate-pair infor- esearch(RIKEN), Total ba 018281227520093,926 mation. We further visually inspected and cu- Japan Total vector masked(bp) 2371 rated the scaffold tiling of the components to Total contaminant masked(bp) 308426 27781 further increase its acc uracy. For the final Csa Average contig length(bp) 7,093 66,978 assembly, all but the partitioning was ignored, a nger Centre, UK Number of accession records d an independent, ab initio reconstruction of Number of contigs 0 74324 2,599 the sequence in each component was obtained 8 0 689,059, 692 246, 118,000 by applying our whole-genome assembly algo- Total vector masked(bp) 427326 25,054 rithm to the partitioned, relevant Celera data and 9 otal contaminan verage contig length(bp) 0 2,066, 305 374 561 the shredded, faux reads of the partitioned, rel- 3 94697 evant bactig data. Others* Number of accession records 42 458 Number of contigs 3,458 2.3 Whole-genome assembly Total base pai 5.564879283358877246,474.157 Total vector masked 57448 27947 32136 The algorithms used for whole-genome as ..665 1791.849 sembly (WGA) of the human genome were enhancements to those used to produce the erage contig length(bp) e of the drosophila All centers combined Number of accession records Number of contigs 409628 9,137 The WGA assembler consists of a pipeline 3360047574835,72226 mposed of five prin ages: Screener Total vector masked(bp) 2438575 82, 284 Overlapper, Unitigger, Scaffolder, and Repeat Total contaminant masked 16311,664 365230 Average contig length(bp) 811 66 and marks all microsatellite repeats with less uting serang tots o shung mologyr Keio University School of Medicine: lawrence ing Alu, Line, and ribosomal DNA.Marked 914 than a 6-bp element, and screens out all H: Genome Therapeutics Corporation; GENOSCOPE me Center: known interspersed repeat elements, includ- Livermore National Laboratory: Cold Spring Harbor Laboratory: Los ALamos National Laboratory: Max-Planck Institut fuer regions get searched search; The Institute of Physical and Cherechnology Corporation: Stanford University: The Institute for Genomic screened regions do not get searched, but can rsity of Oklahoma: Universi Southwestern Medical Center, University of washingto tThe 4,405,, 825 bases contributed by all centers were be part of an overlap that involves unscreened shredded into faux reads resulting in 2.96x coverage of the genome atching segments 1310 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
information was ignored because some BACs were not correctly placed on the PFP physical map and because we found strong evidence that at least 2.2% of the BACs contained sequence data that were not part of the given BAC (41), possibly as a result of sample-tracking errors (see below). In short, we performed a true, ab initio whole-genome assembly in which we took the expedient of deriving additional sequence coverage, but not mate pairs, assembled bactigs, or genome locality, from some externally generated data. In the compartmentalized shotgun assembly (CSA), Celera and PFP data were partitioned into the largest possible chromosomal segments or “components” that could be determined with confidence, and then shotgun assembly was applied to each partitioned subset wherein the bactig data were again shredded into faux reads to ensure an independent ab initio assembly of the component. By subsetting the data in this way, the overall computational effort was reduced and the effect of interchromosomal duplications was ameliorated. This also resulted in a reconstruction of the genome that was relatively independent of the whole-genome assembly results so that the two assemblies could be compared for consistency. The quality of the partitioning into components was crucial so that different genome regions were not mixed together. We constructed components from (i) the longest scaffolds of the sequence from each BAC and (ii) assembled scaffolds of data unique to Celera’s data set. The BAC assemblies were obtained by a combining assembler that used the bactigs and the 53 Celera data mapped to those bactigs as input. This effort was undertaken as an interim step solely because the more accurate and complete the scaffold for a given sequence stretch, the more accurately one can tile these scaffolds into contiguous components on the basis of sequence overlap and mate-pair information. We further visually inspected and curated the scaffold tiling of the components to further increase its accuracy. For the final CSA assembly, all but the partitioning was ignored, and an independent, ab initio reconstruction of the sequence in each component was obtained by applying our whole-genome assembly algorithm to the partitioned, relevant Celera data and the shredded, faux reads of the partitioned, relevant bactig data. 2.3 Whole-genome assembly The algorithms used for whole-genome assembly (WGA) of the human genome were enhancements to those used to produce the sequence of the Drosophila genome reported in detail in (28). The WGA assembler consists of a pipeline composed of five principal stages: Screener, Overlapper, Unitigger, Scaffolder, and Repeat Resolver, respectively. The Screener finds and marks all microsatellite repeats with less than a 6-bp element, and screens out all known interspersed repeat elements, including Alu, Line, and ribosomal DNA. Marked regions get searched for overlaps, whereas screened regions do not get searched, but can be part of an overlap that involves unscreened matching segments. Table 2. GenBank data input into assembly. Center Statistics Completion phase sequence 0 1 and 2 3 Whitehead Institute/ Number of accession records 2,825 6,533 363 MIT Center for Number of contigs 243,786 138,023 363 Genome Research, Total base pairs 194,490,158 1,083,848,245 48,829,358 USA Total vector masked (bp) 1,553,597 875,618 2,202 Total contaminant masked (bp) 13,654,482 4,417,055 98,028 Average contig length (bp) 798 7,853 134,516 Washington University, Number of accession records 19 3,232 1,300 USA Number of contigs 2,127 61,812 1,300 Total base pairs 1,195,732 561,171,788 164,214,395 Total vector masked (bp) 21,604 270,942 8,287 Total contaminant masked (bp) 22,469 1,476,141 469,487 Average contig length (bp) 562 9,079 126,319 Baylor College of Number of accession records 0 1,626 363 Medicine, USA Number of contigs 0 44,861 363 Total base pairs 0 265,547,066 49,017,104 Total vector masked (bp) 0 218,769 4,960 Total contaminant masked (bp) 0 1,784,700 485,137 Average contig length (bp) 0 5,919 135,033 Production Sequencing Number of accession records 135 2,043 754 Facility, DOE Joint Number of contigs 7,052 34,938 754 Genome Institute, Total base pairs 8,680,214 294,249,631 60,975,328 USA Total vector masked (bp) 22,644 162,651 7,274 Total contaminant masked (bp) 665,818 4,642,372 118,387 Average contig length (bp) 1,231 8,422 80,867 The Institute of Physical Number of accession records 0 1,149 300 and Chemical Number of contigs 0 25,772 300 Research (RIKEN), Total base pairs 0 182,812,275 20,093,926 Japan Total vector masked (bp) 0 203,792 2,371 Total contaminant masked (bp) 0 308,426 27,781 Average contig length (bp) 0 7,093 66,978 Sanger Centre, UK Number of accession records 0 4,538 2,599 Number of contigs 0 74,324 2,599 Total base pairs 0 689,059,692 246,118,000 Total vector masked (bp) 0 427,326 25,054 Total contaminant masked (bp) 0 2,066,305 374,561 Average contig length (bp) 0 9,271 94,697 Others* Number of accession records 42 1,894 3,458 Number of contigs 5,978 29,898 3,458 Total base pairs 5,564,879 283,358,877 246,474,157 Total vector masked (bp) 57,448 279,477 32,136 Total contaminant masked (bp) 575,366 1,616,665 1,791,849 Average contig length (bp) 931 9,478 71,277 All centers combined† Number of accession records 3,021 21,015 9,137 Number of contigs 258,943 409,628 9,137 Total base pairs 209,930,983 3,360,047,574 835,722,268 Total vector masked (bp) 1,655,293 2,438,575 82,284 Total contaminant masked (bp) 14,918,135 16,311,664 3,365,230 Average contig length (bp) 811 8,203 91,466 *Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center; Genomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE; Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence Livermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuer Molekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic Research; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of Texas Southwestern Medical Center, University of Washington. †The 4,405,700,825 bases contributed by all centers were shredded into faux reads resulting in 2.963 coverage of the genome. T H E H UMAN G ENOME 1310 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME against every other read in search of complete 100-to 400-bp repetitive segment? nd other aggressive and thus more likely to make a The Overlapper compares every read singly interspersed Alu element mistake. For the human assembly, we contin- end-to-end overlaps of at least 40 bp and with The result of running the Unitigger was ued to use the first""Rocks"substage where no more than 6% differences in the match. thus a set of correctly assembled subcontigs all unitigs with a good, but not definitive, Because all data are scrupulously covering an estimated 73.6% of the human discriminator score are placed in a scaffold trimmed, the Overlapper can insist genome. The Scaffolder then proceeded to gap. This was done with the condition that plete overlap matches. Computing Ise mate-pair information to link these to- two or more mate pairs with one of their all overlaps took roughly 10,000 CPU hours gether into scaffolds. When there are two or reads already in the scaffold unambiguously with a suite of four-processor Alpha SMPs more mate pairs that imply that a given pair place the unitig in the given gap. We estimate with 4 gigabytes of RAM. This took 4 to 5 of U-unitigs are at a certain distance and the probability of inserting a unitig into an days in elapsed time with 40 such machines orientation with respect to each other, the incorrect gap with this strategy to be less than operating in parallel robability of this being wrong is again 10- based on a probabilistic analysis Every overlap computed above is statisti- roughly 1 in 10, assuming that mate pairs We revised the ensuing"Stones"substage cally a l-in-107 event and thus not a coinci- are false less than 2% of the time. Thus, one of the human assembly, making it more like dental event. What makes assembly combi- can with high confidence link together all the mechanism suggested in our earlier work natorially difficult is that while many over- U-unitigs that are linked by at least two 2-or (43). For each gap, every read R that is placed laps are actually sampled from overlapping 10-kbp mate pairs producing intermediate- in the gap by virtue of its mated pair M being regions of the genome, and thus imply that sized scaffolds that are then recursively in a contig of the scaffold and implying rs the sequence reads should be assembled to- linked together by confirming 50-kbp mate placement is collected. Celera's mate-pairing gether, even more overlaps are actually from pairs and BAC end sequences. This process information is correct more than 99% of the wo distinct copies of a low-copy repeated yielded scaffolds that are on the order of time. Thus, almost every, but not all, of the an error if put together. We call the former their contigs that generally correspond to re- a read does not belong it rarely agrees with N "true overlaps"and the latter"repeat-induced petitive elements and occasionally to small the remainder of the reads. Therefore, we overlaps. " The assembler must avoid choos- sequencing gaps. These scaffolds reconstruct simply assemble this set of reads within the ng repeat-induced overlaps, especially early the majority of the unique sequence within a gap, eliminating any reads that conflict with in the process. For the Drosophila assembly, we engaged more reliable than the one it replaced for the o ger.We first find all assemblies of reads that in a three-stage repeat resolution strategy Drosophila assembly; in the assembly of a s appear to be uncontested with respect to all where each stage was progressively more simulated shotgun data set of hi other reads. We call the contigs formed from these subassemblies unitigs (for uniquely as- sembled contigs). Formally, these unitigs are 5.11X Celera Reads Public Bactigs the uncontested interval subgraphs of the 21B graph of all overlaps (42). Unfortunately, al- though empirically many of these assemblies are correct(and thus involve only true over- laps), some are in fact collections of reads from several copies of a repetitive element Shredd Matcher that have been overcollapsed into a single 2.96X FaUx ubassembly. However, the overcollapsed Reads Celera-uni Bactigs Celera pairs 33sE= are easily identified because their av- eads binned by BAC) 8 overage depth is too high to be con- with the overall level of sequence WGA WGA coverage. We developed a simple statistical discriminator that gives the logarithm of the odds ratio that a unitig is composed of unique Unique BAC DNA or of a repeat consisting of two or more copies. The discriminator, set to a sufficiently stringent threshold, identifies a subset of the Tiler unitigs that we are certain are correct. In addition, a second, less stringent threshold identifies a subset of remaining unitigs very kely to be correctly assembled, of which we elect those that will consistently scaffold (see below ) and thus are again almost certain Components to be correct. We call the union of these two WGA+Shredder ets U-unitigs. Empirically, we found from 6X simulated shotgun of human chromosome 22 that we get U-unitigs covering 98% of the stretches of unique DNA that are >2 kbp WGA Assembly CSA Assembly long. We are further able to identity the Fig. 4. Architecture of Celera's two-pronged assembly strat boundary of the start of a repetitive element process performing the function indicated by its label, with the labels on arcs between ovals at the ends of a U-unitig and leverage this so describing the nature of the objects produced and/or consumed by a process. This figu that U-unitigs span more than 93% of all summarizes the discussion in the text that defines the terms and phrases used www.sciencemagorgSciEnceVol29116FebRuarY2001 1311
The Overlapper compares every read against every other read in search of complete end-to-end overlaps of at least 40 bp and with no more than 6% differences in the match. Because all data are scrupulously vectortrimmed, the Overlapper can insist on complete overlap matches. Computing the set of all overlaps took roughly 10,000 CPU hours with a suite of four-processor Alpha SMPs with 4 gigabytes of RAM. This took 4 to 5 days in elapsed time with 40 such machines operating in parallel. Every overlap computed above is statistically a 1-in-1017 event and thus not a coincidental event. What makes assembly combinatorially difficult is that while many overlaps are actually sampled from overlapping regions of the genome, and thus imply that the sequence reads should be assembled together, even more overlaps are actually from two distinct copies of a low-copy repeated element not screened above, thus constituting an error if put together. We call the former “true overlaps” and the latter “repeat-induced overlaps.” The assembler must avoid choosing repeat-induced overlaps, especially early in the process. We achieve this objective in the Unitigger. We first find all assemblies of reads that appear to be uncontested with respect to all other reads. We call the contigs formed from these subassemblies unitigs (for uniquely assembled contigs). Formally, these unitigs are the uncontested interval subgraphs of the graph of all overlaps (42). Unfortunately, although empirically many of these assemblies are correct (and thus involve only true overlaps), some are in fact collections of reads from several copies of a repetitive element that have been overcollapsed into a single subassembly. However, the overcollapsed unitigs are easily identified because their average coverage depth is too high to be consistent with the overall level of sequence coverage. We developed a simple statistical discriminator that gives the logarithm of the odds ratio that a unitig is composed of unique DNA or of a repeat consisting of two or more copies. The discriminator, set to a sufficiently stringent threshold, identifies a subset of the unitigs that we are certain are correct. In addition, a second, less stringent threshold identifies a subset of remaining unitigs very likely to be correctly assembled, of which we select those that will consistently scaffold (see below), and thus are again almost certain to be correct. We call the union of these two sets U-unitigs. Empirically, we found from a 63 simulated shotgun of human chromosome 22 that we get U-unitigs covering 98% of the stretches of unique DNA that are .2 kbp long. We are further able to identify the boundary of the start of a repetitive element at the ends of a U-unitig and leverage this so that U-unitigs span more than 93% of all singly interspersed Alu elements and other 100-to 400-bp repetitive segments. The result of running the Unitigger was thus a set of correctly assembled subcontigs covering an estimated 73.6% of the human genome. The Scaffolder then proceeded to use mate-pair information to link these together into scaffolds. When there are two or more mate pairs that imply that a given pair of U-unitigs are at a certain distance and orientation with respect to each other, the probability of this being wrong is again roughly 1 in 1010, assuming that mate pairs are false less than 2% of the time. Thus, one can with high confidence link together all U-unitigs that are linked by at least two 2- or 10-kbp mate pairs producing intermediatesized scaffolds that are then recursively linked together by confirming 50-kbp mate pairs and BAC end sequences. This process yielded scaffolds that are on the order of megabase pairs in size with gaps between their contigs that generally correspond to repetitive elements and occasionally to small sequencing gaps. These scaffolds reconstruct the majority of the unique sequence within a genome. For the Drosophila assembly, we engaged in a three-stage repeat resolution strategy where each stage was progressively more aggressive and thus more likely to make a mistake. For the human assembly, we continued to use the first “Rocks” substage where all unitigs with a good, but not definitive, discriminator score are placed in a scaffold gap. This was done with the condition that two or more mate pairs with one of their reads already in the scaffold unambiguously place the unitig in the given gap. We estimate the probability of inserting a unitig into an incorrect gap with this strategy to be less than 1027 based on a probabilistic analysis. We revised the ensuing “Stones” substage of the human assembly, making it more like the mechanism suggested in our earlier work (43). For each gap, every read R that is placed in the gap by virtue of its mated pair M being in a contig of the scaffold and implying R’s placement is collected. Celera’s mate-pairing information is correct more than 99% of the time. Thus, almost every, but not all, of the reads in the set belong in the gap, and when a read does not belong it rarely agrees with the remainder of the reads. Therefore, we simply assemble this set of reads within the gap, eliminating any reads that conflict with the assembly. This operation proved much more reliable than the one it replaced for the Drosophila assembly; in the assembly of a simulated shotgun data set of human chromoFig. 4. Architecture of Celera’s two-pronged assembly strategy. Each oval denotes a computation process performing the function indicated by its label, with the labels on arcs between ovals describing the nature of the objects produced and/or consumed by a process. This figure summarizes the discussion in the text that defines the terms and phrases used. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1311 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME some 22, all stones were placed correctly. have required a computer with a 600-gigabyte tribution of each was essentially exponen The final method of resolving gaps is to RAM. By making the Overlapper and Unitigger More than 50% of all gaps were less than 500 the gap. We call this external gap"walking. computation with a maximum of instantaneous long, and no gap was >100 kbp long. Similar- We did not include the very aggressive "Peb- usage of 28 gigabytes of RAM. Moreover, the ly, more than 65% of the sequence is in contigs bles"substage described in our Drosophila incremental nature of the first three stages al- >30 kbp, more than 31% is in contigs >100 ork, which made enough mistakes so as to lowed us to continually update the state of this kbp, and the largest contig was 1. 22 Mbp long produce repeat reconstructions for long inter- of the computation as data were delivered Table 3 gives detailed summary statistics for spersed elements whose quality was only and then perform a 7-day run to complete Scaf- the structure of this assembly with a direct 99.62% correct. We decided that for the hu- folding and Repeat Resolution whenever de- comparison to the compartmentalized shotgun man genome it was philosophically better not sired. For our assembly operations, the total assembly to introduce a step that was certain to produce compute infrastructure consists of 10 four-pro- somewhat larger number of gaps of some- cluster(Compaq's ES40, Regatta) and a 16- assemby mentalized shotgun less than 99.99% accuracy. The cost was a cessor SMPs with 4 gigabytes of memory per 2.4 what larger size. processor NUMA machine with 64 gigabytes In addition to the WGA approach, we pur At the final stage of the assembly process, of memory(Compag's gS160, wildfire). The sued a localized assembly approach that w nd also at several intermediate points, a total compute for a run of the assembler was intended to subdivide the genome into seg- consensus sequence of every contig is pro- roughly 20,000 CPU hours. nents. each of which could be shotgun as- duced. Our algorithm is driven by the princi- The assembly of Celera's data, together sembled individually. We expected that this e of maximum pa y, with quality- with the shredded bactig data, produced a set of would help in resolution of large interchro- value-weighted measures for evaluating each scaffolds totaling 2.848 Gbp in span and con- mosomal duplications and base. The net effect is a Bayesian estimate of sisting of 2. 586 Gbp of sequence. The chaff, or tics for calculating U-unitigs. The compart the correct base to report at each position. set of reads not incorporated in the assembly, mentalized assembly process involved clus- N Consensus generation uses Celera data when- numbered 11.27 million(26%), which is con- tering Celera reads and bactigs into large, ever it is present In the event that no Celera sistent with our experience for Drosophila. multiple megabase regions of the genome, data cover a given region, the bac data More than 84% of the genome was covered by and then running the WGa assembler on the scaffolds >100 kbp long, and these averaged Celera data and shredded, faux reads ob- a A key element of achieving a WGA of the 91% sequence and 9% gaps with a total of tained from the bactig data human genome was to parallelize the Overlap- 2. 297 Gbp of sequence. There were a total of The first phase of the CSa strategy was to per and the central consensus sequence- con- 93, 857 gaps among the 1637 scaffolds >100 separate Celera reads into those that matched structing subroutines. In addition, memory was kbp. The average scaffold size was 1.5 Mbp, the BAC contigs for a particular PFP BAC a real issue-a straightforward application of the average contig size was 24.06 kbp, and the entry, and those that did not match any public o the software we had built for Drosophila would average gap size was 2.43 kbp, where the dis- data. Such matches must be guaranteed to E Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. Scaffold size >30 kbp >100 kbp >500 kbp >1000kbp Compartmentalized shotgun assembly o. of bp in 2748892430 2,700489906 2489357,260 including intrascaffold gap 22486891283 No of bp in contigs 653979,733 2.524251302 2491.538372 2320648201 2,10652190 1935 1,060 170033 107199 93138 92078 No. of gaps≤1kbp 72,091 69.175 67289 59915 53,354 Average scaffold size 54.217 1,395602 3.118848 Average contig size 15609 22496 23,242 25686 Average intrascaffold gap size 1832 19883 1988321 1988321 1988321 1988321 ole-genome assembly No of bp in scaffolds 2574792618 2.525334447 2,328.535 No. of bp in contigs 2,334,343339 2,297,678935 2,143002 No, of scaffolds No. of contigs 221,03 84 No of gaps 102068 96682 No. of gaps≤1kbp 132 4079 1,027,041 2846620 Average contig size(bp) 23.534 24.061 25.999 Average intrascaffold gap size 2487 2213 1,224,073 1.224073 1,224073 1.224073 1.224073 1312 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
some 22, all stones were placed correctly. The final method of resolving gaps is to fill them with assembled BAC data that cover the gap. We call this external gap “walking.” We did not include the very aggressive “Pebbles” substage described in our Drosophila work, which made enough mistakes so as to produce repeat reconstructions for long interspersed elements whose quality was only 99.62% correct. We decided that for the human genome it was philosophically better not to introduce a step that was certain to produce less than 99.99% accuracy. The cost was a somewhat larger number of gaps of somewhat larger size. At the final stage of the assembly process, and also at several intermediate points, a consensus sequence of every contig is produced. Our algorithm is driven by the principle of maximum parsimony, with qualityvalue–weighted measures for evaluating each base. The net effect is a Bayesian estimate of the correct base to report at each position. Consensus generation uses Celera data whenever it is present. In the event that no Celera data cover a given region, the BAC data sequence is used. A key element of achieving a WGA of the human genome was to parallelize the Overlapper and the central consensus sequence–constructing subroutines. In addition, memory was a real issue—a straightforward application of the software we had built for Drosophila would have required a computer with a 600-gigabyte RAM. By making the Overlapper and Unitigger incremental, we were able to achieve the same computation with a maximum of instantaneous usage of 28 gigabytes of RAM. Moreover, the incremental nature of the first three stages allowed us to continually update the state of this part of the computation as data were delivered and then perform a 7-day run to complete Scaffolding and Repeat Resolution whenever desired. For our assembly operations, the total compute infrastructure consists of 10 four-processor SMPs with 4 gigabytes of memory per cluster (Compaq’s ES40, Regatta) and a 16- processor NUMA machine with 64 gigabytes of memory (Compaq’s GS160, Wildfire). The total compute for a run of the assembler was roughly 20,000 CPU hours. The assembly of Celera’s data, together with the shredded bactig data, produced a set of scaffolds totaling 2.848 Gbp in span and consisting of 2.586 Gbp of sequence. The chaff, or set of reads not incorporated in the assembly, numbered 11.27 million (26%), which is consistent with our experience for Drosophila. More than 84% of the genome was covered by scaffolds .100 kbp long, and these averaged 91% sequence and 9% gaps with a total of 2.297 Gbp of sequence. There were a total of 93,857 gaps among the 1637 scaffolds .100 kbp. The average scaffold size was 1.5 Mbp, the average contig size was 24.06 kbp, and the average gap size was 2.43 kbp, where the distribution of each was essentially exponential. More than 50% of all gaps were less than 500 bp long, .62% of all gaps were less than 1 kbp long, and no gap was .100 kbp long. Similarly, more than 65% of the sequence is in contigs .30 kbp, more than 31% is in contigs .100 kbp, and the largest contig was 1.22 Mbp long. Table 3 gives detailed summary statistics for the structure of this assembly with a direct comparison to the compartmentalized shotgun assembly. 2.4 Compartmentalized shotgun assembly In addition to the WGA approach, we pursued a localized assembly approach that was intended to subdivide the genome into segments, each of which could be shotgun assembled individually. We expected that this would help in resolution of large interchromosomal duplications and improve the statistics for calculating U-unitigs. The compartmentalized assembly process involved clustering Celera reads and bactigs into large, multiple megabase regions of the genome, and then running the WGA assembler on the Celera data and shredded, faux reads obtained from the bactig data. The first phase of the CSA strategy was to separate Celera reads into those that matched the BAC contigs for a particular PFP BAC entry, and those that did not match any public data. Such matches must be guaranteed to Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. Scaffold size All .30 kbp .100 kbp .500 kbp .1000 kbp Compartmentalized shotgun assembly No. of bp in scaffolds 2,905,568,203 2,748,892,430 2,700,489,906 2,489,357,260 2,248,689,128 (including intrascaffold gaps) No. of bp in contigs 2,653,979,733 2,524,251,302 2,491,538,372 2,320,648,201 2,106,521,902 No. of scaffolds 53,591 2,845 1,935 1,060 721 No. of contigs 170,033 112,207 107,199 93,138 82,009 No. of gaps 116,442 109,362 105,264 92,078 81,288 No. of gaps #1 kbp 72,091 69,175 67,289 59,915 53,354 Average scaffold size (bp) 54,217 966,219 1,395,602 2,348,450 3,118,848 Average contig size (bp) 15,609 22,496 23,242 24,916 25,686 Average intrascaffold gap size (bp) 2,161 2,054 1,985 1,832 1,749 Largest contig (bp) 1,988,321 1,988,321 1,988,321 1,988,321 1,988,321 % of total contigs 100 95 94 87 79 Whole-genome assembly No. of bp in scaffolds (including intrascaffold gaps) 2,847,890,390 2,574,792,618 2,525,334,447 2,328,535,466 2,140,943,032 No. of bp in contigs 2,586,634,108 2,334,343,339 2,297,678,935 2,143,002,184 1,983,305,432 No. of scaffolds 118,968 2,507 1,637 818 554 No. of contigs 221,036 99,189 95,494 84,641 76,285 No. of gaps 102,068 96,682 93,857 83,823 75,731 No. of gaps #1 kbp 62,356 60,343 59,156 54,079 49,592 Average scaffold size (bp) 23,938 1,027,041 1,542,660 2,846,620 3,864,518 Average contig size (bp) 11,702 23,534 24,061 25,319 25,999 Average intrascaffold gap size (bp) 2,560 2,487 2,426 2,213 2,082 Largest contig (bp) 1,224,073 1,224,073 1,224,073 1,224,073 1,224,073 % of total contigs 100 90 89 83 77 T H E H UMAN G ENOME 1312 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from