articles The remaining 39 contigs containing 0.3% of the sequence were not We then merged the sequences from overlapping sequenced clones(Fig. 6), using the computer program GigAssembler. The Fingerprint clone contig program considers nearby sequenced clones, detects overlap Pick clones for sequencing between the initial sequence contigs in these clones, merges the overlapping sequences and attempts to order and orient the sequence contigs. It begins by aligning the initial sequence contigs from one clone with those from other clones in the same fingerprint clone contig on the basis of length of alignment, per cent identity of Sequence to at least draft coverag Bquenced-cione-contig scaffold the alignment, position in the sequenced clone layout and other factors. Alignments are limited to one end of each initial sequence contig for partially overlapping contigs or to both ends of an initial clone a Sequenced clone B sequence contig contained entirely within another; this eliminates Merge data internal alignments that may reflect repeated sequence or possible Merged sequence contig misassembly(Fig. 6b). Beginning with the highest scoring pairs, initial sequence contigs are then integrated to produce merged Order and orient with mRNA, paired end reads, other information Sequence-contig scaffold overlap between them and then rebuilds the seque at or within &r") quence contigs(usually referred to simply as The program refines the arrangement of the clones within the sequence itis. Next, the program selects a sequence path through the sequence contigs Figure 7 Levels of clone and sequence coverage. A fingerprint clone contig is (Fig. 6c). It tries to use the highest quality data by preferring longer assembled by using the computer program FPC4, s to analyse the restriction enzyme initial sequence contigs and avoiding the first and last 250 bases of digestion patterns of many large-insert clones. Clones are then selected for sequencing to initial sequence contigs where possible. Finally, it attempts to order minimize overlap between adjacent clones. For a clone to be selected, all of its restriction and orient the sequence contigs by using additional information, enzyme fragments(except the two vector-insert junction fragments)must be shared with including sequence data from paired-end plasmid and BAC reads, at least one of its neighbours on each side in the contig. Once these overlapping clones known messenger RNAs and ESTs, as well as additional linking have been sequenced, the set is a'sequenced-clone contig When all selected clones information provided by centres. The sequence contigs are thereby from a fingerprint clone contig have been sequenced, the sequenced-clone contig will be linked together to create ' sequence-contig scaffolds'(Fig. 6d).The the same as the fingerprint clone contig. Until then, a fingerprint clone contig may contain process also joins overla sequenced clones into several sequenoed-clone contigs. After individual clones (for example, A and B)have been clone contigs and links sequenced-clone contigs to form s equenced to draft coverage and the clones have been mapped, the data are analysed by clone-contig scaffolds. a fingerprint clone contig may contain sigAssembler(Fig 6), producing merged sequence contigs from initial sequence contigs, several sequenced-clone contigs, because bridging clones remain and linking these to form sequence-contig scaffolds(see Box 1) to be sequenced. The assembly contained 4, 884 sequenced-clone able 5 The draft genome sequence Sequence from clones (b) Sequence from contigs(kb) Finished clones Draft clones Pre-draft clones ished dones sequence configs All 826.441 1,734,9 131.476 B40815 893,175 72461 11057 B283 108,1 ,2 65.14 8465 68,98 32,42 78302 29,8 5 2,35 4.056 20222 2,056 02 2395 声如如mM可可bbL nes. Thus, the draft consists of approxmately one-third finished sequence, one-third deep coverage sequence and one-third draft/pre-draft coverage sequence In al of the statistics, we count only nonoverlapping bases in the draft gen 870 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
The remaining 39 contigs containing 0.3% of the sequence were not positioned at all. We then merged the sequences from overlapping sequenced clones (Fig. 6), using the computer program GigAssembler104. The program considers nearby sequenced clones, detects overlaps between the initial sequence contigs in these clones, merges the overlapping sequences and attempts to order and orient the sequence contigs. It begins by aligning the initial sequence contigs from one clone with those from other clones in the same ®ngerprint clone contig on the basis of length of alignment, per cent identity of the alignment, position in the sequenced clone layout and other factors. Alignments are limited to one end of each initial sequence contig for partially overlapping contigs or to both ends of an initial sequence contig contained entirely within another; this eliminates internal alignments that may re¯ect repeated sequence or possible misassembly (Fig. 6b). Beginning with the highest scoring pairs, initial sequence contigs are then integrated to produce `merged sequence contigs' (usually referred to simply as `sequence contigs'). The program re®nes the arrangement of the clones within the ®ngerprint clone contig on the basis of the extent of sequence overlap between them and then rebuilds the sequence contigs. Next, the program selects a sequence path through the sequence contigs (Fig. 6c). It tries to use the highest quality data by preferring longer initial sequence contigs and avoiding the ®rst and last 250 bases of initial sequence contigs where possible. Finally, it attempts to order and orient the sequence contigs by using additional information, including sequence data from paired-end plasmid and BAC reads, known messenger RNAs and ESTs, as well as additional linking information provided by centres. The sequence contigs are thereby linked together to create `sequence-contig scaffolds' (Fig. 6d). The process also joins overlapping sequenced clones into sequencedclone contigs and links sequenced-clone contigs to form sequencedclone-contig scaffolds. A ®ngerprint clone contig may contain several sequenced-clone contigs, because bridging clones remain to be sequenced. The assembly contained 4,884 sequenced-clone articles 870 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Fingerprint clone contig Sequenced-clone contig Pick clones for sequencing Merge data Sequenced clone A Sequenced clone B Sequence to at least draft coverage Initial sequence contig Sequenced-clone-contig scaffold Merged sequence contig Sequence-contig scaffold Order and orient with mRNA, paired end reads, other information A B Figure 7 Levels of clone and sequence coverage. A `®ngerprint clone contig' is assembled by using the computer program FPC84,451 to analyse the restriction enzyme digestion patterns of many large-insert clones. Clones are then selected for sequencing to minimize overlap between adjacent clones. For a clone to be selected, all of its restriction enzyme fragments (except the two vector-insert junction fragments) must be shared with at least one of its neighbours on each side in the contig. Once these overlapping clones have been sequenced, the set is a `sequenced-clone contig'. When all selected clones from a ®ngerprint clone contig have been sequenced, the sequenced-clone contig will be the same as the ®ngerprint clone contig. Until then, a ®ngerprint clone contig may contain several sequenced-clone contigs. After individual clones (for example, A and B) have been sequenced to draft coverage and the clones have been mapped, the data are analysed by GigAssembler (Fig. 6), producing merged sequence contigs from initial sequence contigs, and linking these to form sequence-contig scaffolds (see Box 1). Table 5 The draft genome sequence Chromosome Sequence from clones (kb) Sequence from contigs (kb) Finished clones Draft clones Pre-draft clones Contigs containing ®nished clones Deep coverage sequence contigs Draft/predraft sequence contigs All 826,441 1,734,995 131,476 958,922 840,815 893,175 1 50,851 149,027 12,356 61,001 78,773 72,461 2 46,909 167,439 7,210 53,775 81,569 86,214 3 22,350 152,840 11,057 26,959 79,649 79,638 4 15,914 134,973 17,261 19,096 66,165 82,887 5 37,973 129,581 2,160 48,895 61,387 59,431 6 75,312 76,082 6,696 93,458 28,204 36,428 7 94,845 47,328 4,047 103,188 14,434 28,597 8 14,538 102,484 7,236 16,659 47,198 60,400 9 18,401 77,648 10,864 24,030 42,653 40,230 10 16,889 99,181 11,066 21,421 54,054 51,662 11 13,162 111,092 4,352 16,145 65,147 47,314 12 32,156 84,653 7,651 37,519 43,995 42,946 13 16,818 68,983 7,136 22,191 38,319 32,429 14 58,989 27,370 565 78,302 3,267 5,355 15 2,739 67,453 3,211 3,112 34,758 35,533 16 22,987 48,997 1,143 27,751 20,892 24,484 17 29,881 36,349 6,600 33,531 14,671 24,628 18 5,128 65,284 2,352 6,656 40,947 25,160 19 28,481 26,568 369 32,228 7,188 16,003 20 54,217 5,302 976 56,534 1,065 2,896 21 33,824 0 0 33,824 0 0 22 33,786 0 0 33,786 0 0 X 77,630 45,100 4,941 83,796 14,056 29,820 Y 18,169 3,221 363 20,222 333 1,198 NA 2,434 1,858 844 2,446 122 2,568 UL 2,056 6,182 1,020 2,395 1,969 4,894 ................................................................................................................................................................................................................................................................................................................................................................... The table presents summary statistics for the draft genome sequence over the entire genome and by individual chromosome. NA, clones that could not be placed into the sequenced clone layout. UL, clones that could be placed in the layout, but that could not reliably be placed on a chromosome. First three columns, data from ®nished clones, draft clones and predraft clones. The last three columns break the data down according to the type of sequence contig. Contigs containing ®nished clones represent sequence contigs that consist of ®nished sequence plus any (small) extensions from merged sequence contigs that arise from overlap with ¯anking draft clones. Deep coverage sequence contigs include sequence from two or more overlapping un®nished clones; they consist of roughly full shotgun coverage and thus are longer than the average un®nished sequence contig. Draft/predraft sequence contigs are all of the other sequence contigs in un®nished clones. Thus, the draft genome sequence consists of approximately one-third ®nished sequence, one-third deep coverage sequence and one-third draft/pre-draft coverage sequence. In all of the statistics, we count only nonoverlapping bases in the draft genome sequence. © 2001 Macmillan Magazines Ltd
articles contigs in 942 fingerprint clone contigs Quality assessment The hierarchy of contigs is summarized in Fig. 7. Initial The draft genome sequence already covers the vast majority of the genome, but it remains an incomplete, intermediate product that is then linked to form sequence-contig scaffolds These scaffo regularly updated as we work towards a complete finished sequence ithin sequenced-clone contigs, which in turn reside within finger- The current version contains many gaps and errors. We therefore ne co sought to evaluate the quality of various aspects of the current draft The draft genome sequence nome sequence, including the sequenced clones themselves, their are reported in Tables 5-7, including the proportion represented by sequence-contig scaffolds. nished, draft and predraft categories. The Tables also show the Nucleotide accuracy is reflected in a PhRaP score assigned to numbers and lengths of different types of contig, for each chromo- each base in the draft genome sequence and available to users some and for the genome as a whole hrough the Genome Browsers(see below) and public database The contiguity of the draft genome sequence at each level is an entries. A summary of these scores for the unfinished portion of the mportant feature. Two commonly used statistics have significant genome is shown in Table 9. About 91% of the unfinished draft drawbacks for describing contiguity. The 'average length of a contig genome sequence has or rate of less than I per 10,000 bases is deflated by the presence of many small contigs comprising o a(PhRAP score >40), and about 96% has an error rate of less than 1 small proportion of the genome, whereas the "length-weighted in 1,000 bases(PHRAP> 30). These values are based only on the average length'is inflated by the presence of large segments of quality scores for the bases in the sequenced clones; they do not finished sequence. Instead, we chose to describe the contiguity as a reflect additional confidence in the sequences that are represented in N50 length, defined as the largest length L such that 50% of all sequence has an error rate of less than l per 10,000 baseasgenome property of the 'typical"nucleotide. We used a statistic called the overlapping clones. The finished portion of the draft nucleotides are contained in contigs of size at least L. Individual sequenced clones. We assessed the frequency of mis- The continuity of the draft genome sequence reported here and assemblies, which can occur when the assembly program PHRAP the effectiveness of assembly can be readily seen from the following: joins two nonadjacent regions in the clone into a single initial half of all nucleotides reside within an initial sequence contig of at sequence contig. The frequency of misassemblies depends heavily least 21.7 kb, a sequence contig of at least 82 kb, a sequence-contig on the depth and quality of coverage of each clone and the nature of scaffold of at least 274 kb, a sequenced-clone contig of at least 826 kb the underlying sequence; thus it may vary among genomic regions and a fingerprint clone contig of at least 8.4 Mb(Tables 6, 7). The and among individual centres. Most clone misassemblies are readily cumulative distributions for each of these measures of contiguity corrected as coverage is added during finishing, but they may have are shown in Fig 8, in which the N50 values for each measure can be been propagated into the current version of the draft genome seen as the value at which the cumulative distributions cross 50% lence and they justify caution for certain applications. Ve have also estimated the size of each chromosome, by estimating the gap sizes(see below)and the extent of missing heterochromatic instances in which there was substantial overlap between a dr tion and does not adequately take into account the oversimplifica- clone and a finished clone. We studied 83 Mb of such overlaps, ontigs. We found 5.3 of each chromosome. Nonetheless, it provides a useful way to relate instances per Mb in which the alignment of an initial sequence the draft sequence to the chromosomes. contig to the finished sequence failed to extend to within 200 bases le 6 Clone level Sequenced-clone contigs Fingerprint clone contigs with sequence N50 length(kb) Number N50 length(kb umber N50 length (b 279 1,915 28 234567891 1.550 6.918 ngth estimates are from the draft genome sequence, in which gaps between onby slightly. Forunfnished chromosomes, the N50 length ranges from 1. 5 to 3 times the arithmetic r affords, and 1.5 to 6 times for fingerprint clone contigs with sequen NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd 871
contigs in 942 ®ngerprint clone contigs. The hierarchy of contigs is summarized in Fig. 7. Initial sequence contigs are integrated to create merged sequence contigs, which are then linked to form sequence-contig scaffolds. These scaffolds reside within sequenced-clone contigs, which in turn reside within ®ngerprint clone contigs. The draft genome sequence The result of the assembly process is an integrated draft sequence of the human genome. Several features of the draft genome sequence are reported in Tables 5±7, including the proportion represented by ®nished, draft and predraft categories. The Tables also show the numbers and lengths of different types of contig, for each chromosome and for the genome as a whole. The contiguity of the draft genome sequence at each level is an important feature. Two commonly used statistics have signi®cant drawbacks for describing contiguity. The `average length' of a contig is de¯ated by the presence of many small contigs comprising only a small proportion of the genome, whereas the `length-weighted average length' is in¯ated by the presence of large segments of ®nished sequence. Instead, we chose to describe the contiguity as a property of the `typical' nucleotide. We used a statistic called the `N50 length', de®ned as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L. The continuity of the draft genome sequence reported here and the effectiveness of assembly can be readily seen from the following: half of all nucleotides reside within an initial sequence contig of at least 21.7 kb, a sequence contig of at least 82 kb, a sequence-contig scaffold of at least 274 kb, a sequenced-clone contig of at least 826 kb and a ®ngerprint clone contig of at least 8.4 Mb (Tables 6, 7). The cumulative distributions for each of these measures of contiguity are shown in Fig. 8, in which the N50 values for each measure can be seen as the value at which the cumulative distributions cross 50%. We have also estimated the size of each chromosome, by estimating the gap sizes (see below) and the extent of missing heterochromatic sequence93,94,105±108 (Table 8). This is undoubtedly an oversimpli®cation and does not adequately take into account the sequence status of each chromosome. Nonetheless, it provides a useful way to relate the draft sequence to the chromosomes. Quality assessment The draft genome sequence already covers the vast majority of the genome, but it remains an incomplete, intermediate product that is regularly updated as we work towards a complete ®nished sequence. The current version contains many gaps and errors. We therefore sought to evaluate the quality of various aspects of the current draft genome sequence, including the sequenced clones themselves, their assignment to a position in the ®ngerprint clone contigs, and the assembly of initial sequence contigs from the individual clones into sequence-contig scaffolds. Nucleotide accuracy is re¯ected in a PHRAP score assigned to each base in the draft genome sequence and available to users through the Genome Browsers (see below) and public database entries. A summary of these scores for the un®nished portion of the genome is shown in Table 9. About 91% of the un®nished draft genome sequence has an error rate of less than 1 per 10,000 bases (PHRAP score . 40), and about 96% has an error rate of less than 1 in 1,000 bases (PHRAP . 30). These values are based only on the quality scores for the bases in the sequenced clones; they do not re¯ect additional con®dence in the sequences that are represented in overlapping clones. The ®nished portion of the draft genome sequence has an error rate of less than 1 per 10,000 bases. Individual sequenced clones. We assessed the frequency of misassemblies, which can occur when the assembly program PHRAP joins two nonadjacent regions in the clone into a single initial sequence contig. The frequency of misassemblies depends heavily on the depth and quality of coverage of each clone and the nature of the underlying sequence; thus it may vary among genomic regions and among individual centres. Most clone misassemblies are readily corrected as coverage is added during ®nishing, but they may have been propagated into the current version of the draft genome sequence and they justify caution for certain applications. We estimated the frequency of misassembly by examining instances in which there was substantial overlap between a draft clone and a ®nished clone. We studied 83 Mb of such overlaps, involving about 9,000 initial sequence contigs. We found 5.3 instances per Mb in which the alignment of an initial sequence contig to the ®nished sequence failed to extend to within 200 bases articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 871 Table 6 Clone level contiguity of the draft genome sequence Chromosome Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs with sequence Number N50 length (kb) Number N50 length (kb) Number N50 length (kb) All 4,884 826 2,191 2,279 942 8,398 1 453 650 197 1,915 106 3,537 2 348 1,028 127 3,140 52 10,628 3 409 672 201 1,550 73 5,077 4 384 606 163 1,659 41 6,918 5 385 623 164 1,642 48 5,747 6 292 814 98 3,292 17 24,680 7 224 1.074 86 3,527 29 20,401 8 292 542 115 1,742 43 6,236 9 143 1,242 78 2,411 21 29,108 10 179 1,097 105 1,952 16 30,284 11 224 887 89 3,024 31 9,414 12 196 1,138 76 2,717 28 9,546 13 128 1,151 56 3,257 13 25,256 14 54 3,079 27 8,489 14 22,128 15 123 797 56 2,095 19 8,274 16 159 620 92 1,317 57 2,716 17 138 831 58 2,138 43 2,816 18 137 709 47 2,572 24 4,887 19 159 569 79 1,200 51 1,534 20 42 2,318 20 6,862 9 23,489 21 5 28,515 5 28,515 5 28,515 22 11 23,048 11 23,048 11 23,048 X 325 572 181 1,082 143 1,436 Y 27 1,539 20 3,290 8 5,135 UL 47 227 40 281 40 281 ................................................................................................................................................................................................................................................................................................................................................................... Number and size of sequenced-clone contigs, sequenced-clone-contig scaffolds and those ®ngerprint clone contigs (see Box 1) that contain sequenced clones; some small ®ngerprint clone contigs do not as yet have associated sequence. UL, ®ngerprint clone contigs that could not reliably be placed on a chromosome. These length estimates are from the draft genome sequence, in which gaps between sequence contigs are arbitrarily represented with 100 Ns and gaps between sequence clone contigs with 50,000 Ns for `bridged gaps' and 100,000 Ns for `unbridged gaps'. These arbitrary values differ minimally from empirical estimates of gap size (see text), and using the empirically derived estimates would change the N50 lengths presented here only slightly. For un®nished chromosomes, the N50 length ranges from 1.5 to 3 times the arithmetic mean for sequenced-clone contigs, 1.5 to 3 times for sequenced-clone-contig scaffolds, and 1.5 to 6 times for ®ngerprint clone contigs with sequence. © 2001 Macmillan Magazines Ltd
articles of the end of the contig, suggesting a possible false join in the small rearrangements during growth of the large-insert clones, ssembly of the initial sequence contig. In about half of these cases, regions of low-quality sequence or matches between segmental that a single raw sequence read may have been incorrectly joined. We stated. On the other hand, the criteria for recoglla be over- suggesting a possible misassembly; and 0.5 misassemblies and finished clones may have eliminated verlap instances per Mb in which the alignment indicated that two initial Layout of the sequenced clones. We assessed the accuracy of the sequence contigs that overlapped by at least 150 bp had not been layout of sequenced clones onto the fingerprinted clone contigs by merged by PHRAP. Finally, there were another 0.9 instances per Mb calculating the concordance between the positions assigned to a ith various other problems. This gives a total of 8.6 instances per sequenced clone on the basis of in silico digestion and the position Mb of possible misassembly, with about half being relatively small assigned on the basis of BAC end sequence data. The positions issues involving a few hundred bases agreed in 98% of cases in which independent assignments could be Some of the potential problems might not result from misassem- made by both methods. The results oly, but might reflect sequence polymorphism in the population, studied regions containing both finished and draft genome sequence. These results indicated that sequenced clone order the fingerprint map was reliable to within about half of one clone length(100 kb) a direct test of the layout is also provided by the draft genome sequence assembly itself. with extensive coverage of the genome,a correctly placed clone should usually(although not always)show sequence overlap with its neighbours in the map. We found only 421 instances of singleton' clones that failed to overlap a neighbouring clone. Close examination of the data suggests that most of these are correctly placed, but simply do not yet sequenced clone. About 150 clones appeared to be candidates for being incorrectly placed. Alignment of the fingerprint clone contigs. The alignment of the fingerprint clone contigs with the chromosomes was based on the radiation hybrid, YAC and genetic maps of STSs. The positions of most of the STSs in the draft genome sequence were consistent with lone contig scaffolds these previous maps, but the positions of about 1.7%differed from one or more of them. Some of these disagreements may be due to errors in the layout of the sequenced clones or in the underlying 050010001.5002.0002,5003,0003,5004.0004,5005000 Clone level continuity Figure 9 overview of features of draft human genome. The Figure shows the ccurrences of twelve important types of feature across the human genome. Large cale). Each of the feature types is depicted in a track, from top to bottom as follows Chromosome position in Mb. (2) The approximate positions of Giemsa-stained chromosome bands at the 800 band resolution. (3) Level of coverage in the draft genome sequence. Red, areas covered by finished clones: yellow, areas covered by predraft sequence Regions covered by draft d clones are in orange, with darker shades reflecting increasing shotgun sequence coverage. (4)GC content Percentage of bases in a 20,000 base window that are c or g(5) Repeat density. Red line, density of SINE class repeats in a 100,000-base window; blue line, density of LINE class repeats in a 100,000- base window.(6) Density of SNPs in a 50,000-base window. The SNPs were detected by sequencing and alignments of random genomic reads Some of the heterogeneity in SNP density reflects the methods used for SNP discovery. Rigorous analysis of SNP density equires comparing the number of SNPs identified to the precise number of bases surveyed.() Non-coding RNA genes. Brown, functional RNA genes such as tRNAs, Sequence-contig scaffolds snoRNAs and rRNAs: light orange, RNA pseudogenes (8) CpG islands. Green 01002003004005006007008009001,000 represent regions of 200 bases with CpG levels significantly higher than in the genome as a whole, and GC ratios of at least 50%. (9)Exofish ecores. Regions of homology with the pufferfish T. nigroviridisare blue. (10) ESTs with at least one intron when aligned against genomic DNA are shown as black tick marks (11) The starts of genes predicted by Figure 8 Cumulative distributions of several measures of clone level contiguity and Genie or Ensembl are shown as red ticks. The starts of known genes from the Refseq sequence contiguity. The figures represent the proportion of the draft genome sequence database" are shovn in blue. (12) The names of genes that have been uniquely located contained in contigs of at most the indicated size. a, Clone level contiguity. The clones in the draft genome sequence, characterized and named by the HGM Nomenclature have a tight size distribution with an N50 of 160 kb(corresponding to 50% on the Committee. Known disease genes from the OMIM database are red, other genes blue cumulative distribution). Sequenced-clone contigs represent the next level of continuity, This Figure is based on an earlier version of the draft genome sequence than analysed in and are linked by mRNA sequences or pairs of BAC end sequences to yield the the text, owing to production constraints. We are aware of various errors in the Figure sequenced-clone-contig scaffolds. The underlying contiguity of the layout of sequenced including omissions of some known genes and misplacements of others. Some genes are clones against the fingerprinted clone contigs is only partially shown at this scale. apped to more than one location, owing to errors in assembly, close paralogues or b, Sequence contiguity. The input fragments have low continuity(N50= 21 7 kb). After pseudogenes. Manual review was performed to select the most likely location in these mergingthesequencecontigsgrowtoanN50lengthofabout82kb.AfterlinkingcasesandtocorectotherregionsForupdatedinformationseehttp://genome.ucsc.edu sequence-contig scaffolds with an N50 length of about 274 kb are created andhttp://www.ensemblorg/ 872 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
of the end of the contig, suggesting a possible false join in the assembly of the initial sequence contig. In about half of these cases, the potential misassembly involved fewer than 400 bases, suggesting that a single raw sequence read may have been incorrectly joined. We found 1.9 instances per Mb in which the alignment showed an internal gap, again suggesting a possible misassembly; and 0.5 instances per Mb in which the alignment indicated that two initial sequence contigs that overlapped by at least 150 bp had not been merged by PHRAP. Finally, there were another 0.9 instances per Mb with various other problems. This gives a total of 8.6 instances per Mb of possible misassembly, with about half being relatively small issues involving a few hundred bases. Some of the potential problems might not result from misassembly, but might re¯ect sequence polymorphism in the population, small rearrangements during growth of the large-insert clones, regions of low-quality sequence or matches between segmental duplications. Thus, the frequency of misassemblies may be overstated. On the other hand, the criteria for recognizing overlap between draft and ®nished clones may have eliminated some misassemblies. Layout of the sequenced clones. We assessed the accuracy of the layout of sequenced clones onto the ®ngerprinted clone contigs by calculating the concordance between the positions assigned to a sequenced clone on the basis of in silico digestion and the position assigned on the basis of BAC end sequence data. The positions agreed in 98% of cases in which independent assignments could be made by both methods. The results were also compared with well studied regions containing both ®nished and draft genome sequence. These results indicated that sequenced clone order in the ®ngerprint map was reliable to within about half of one clone length (,100 kb). A direct test of the layout is also provided by the draft genome sequence assembly itself. With extensive coverage of the genome, a correctly placed clone should usually (although not always) show sequence overlap with its neighbours in the map. We found only 421 instances of `singleton' clones that failed to overlap a neighbouring clone. Close examination of the data suggests that most of these are correctly placed, but simply do not yet overlap an adjacent sequenced clone. About 150 clones appeared to be candidates for being incorrectly placed. Alignment of the ®ngerprint clone contigs. The alignment of the ®ngerprint clone contigs with the chromosomes was based on the radiation hybrid, YAC and genetic maps of STSs. The positions of most of the STSs in the draft genome sequence were consistent with these previous maps, but the positions of about 1.7% differed from one or more of them. Some of these disagreements may be due to errors in the layout of the sequenced clones or in the underlying articles 872 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 0 100 200 300 400 500 600 700 800 900 1,000 0 10 20 30 40 50 60 70 80 90 100 Size (kb) Sequence level continuity Clone level continuity Cumulative percentage b a Initial sequence contigs Sequence contigs Sequence-contig scaffolds 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 0 10 20 30 40 50 60 70 80 90 100 Size (kb) Cumulative percentage Sequenced clones Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs Figure 8 Cumulative distributions of several measures of clone level contiguity and sequence contiguity. The ®gures represent the proportion of the draft genome sequence contained in contigs of at most the indicated size. a, Clone level contiguity. The clones have a tight size distribution with an N50 of , 160 kb (corresponding to 50% on the cumulative distribution). Sequenced-clone contigs represent the next level of continuity, and are linked by mRNA sequences or pairs of BAC end sequences to yield the sequenced-clone-contig scaffolds. The underlying contiguity of the layout of sequenced clones against the ®ngerprinted clone contigs is only partially shown at this scale. b, Sequence contiguity. The input fragments have low continuity (N50 = 21.7 kb). After merging, the sequence contigs grow to an N50 length of about 82 kb. After linking, sequence-contig scaffolds with an N50 length of about 274 kb are created. Figure 9 Overview of features of draft human genome. The Figure shows the occurrences of twelve important types of feature across the human genome. Large grey blocks represent centromeres and centromeric heterochromatin (size not precisely to scale). Each of the feature types is depicted in a track, from top to bottom as follows. (1) Chromosome position in Mb. (2) The approximate positions of Giemsa-stained chromosome bands at the 800 band resolution. (3) Level of coverage in the draft genome sequence. Red, areas covered by ®nished clones; yellow, areas covered by predraft sequence. Regions covered by draft sequenced clones are in orange, with darker shades re¯ecting increasing shotgun sequence coverage. (4) GC content. Percentage of bases in a 20,000 base window that are C or G. (5) Repeat density. Red line, density of SINE class repeats in a 100,000-base window; blue line, density of LINE class repeats in a 100,000- base window. (6) Density of SNPs in a 50,000-base window. The SNPs were detected by sequencing and alignments of random genomic reads. Some of the heterogeneity in SNP density re¯ects the methods used for SNP discovery. Rigorous analysis of SNP density requires comparing the number of SNPs identi®ed to the precise number of bases surveyed. (7) Non-coding RNA genes. Brown, functional RNA genes such as tRNAs, snoRNAs and rRNAs; light orange, RNA pseudogenes. (8) CpG islands. Green ticks represent regions of , 200 bases with CpG levels signi®cantly higher than in the genome as a whole, and GC ratios of at least 50%. (9) Exo®sh ecores. Regions of homology with the puffer®sh T. nigroviridis 292 are blue. (10) ESTs with at least one intron when aligned against genomic DNA are shown as black tick marks. (11) The starts of genes predicted by Genie or Ensembl are shown as red ticks. The starts of known genes from the RefSeq database110 are shown in blue. (12) The names of genes that have been uniquely located in the draft genome sequence, characterized and named by the HGM Nomenclature Committee. Known disease genes from the OMIM database are red, other genes blue. This Figure is based on an earlier version of the draft genome sequence than analysed in the text, owing to production constraints. We are aware of various errors in the Figure, including omissions of some known genes and misplacements of others. Some genes are mapped to more than one location, owing to errors in assembly, close paralogues or pseudogenes. Manual review was performed to select the most likely location in these cases and to correct other regions. For updated information, see http://genome.ucsc.edu/ and http://www.ensembl.org/. Q © 2001 Macmillan Magazines Ltd
articles Chromosome Initial sequence contigs Sequence contigs Sequence. contig scaffolds Number N50 length (kb) Number N50 length (b) Number N50 length (b) 396913 87,757 23.048 23048.1 4,607 2.610 contigs in the genome sequance, the N50 length ranges from 1.7 to 5.5 times the arithmetic mean for initial sequence oontigs, 2.5 to 8.2 times for merged sequence contigs, and 6. 1 to 10 times for sequence-o Sequence gi basest(MD) Number Total bases Number Total Number in gaps (Mb) in gaps(Mb) in gaps°Mb) 145514 3.28 86.9 2,344 433 14 240 caffald (including gaps contained within clonesis 2.916G fingerprint done contigs that contain sequenced clones excluding gaps for centromeres. timate an average size of o 17 Mb per FCC gap, based on retrospective estimates of the clone coverage of chromosomes 21 and 22. Gap estimates for chror omosomes We estimate the average number of bases in sequence gaps from alignments of the initial sequence contigs of unfinished clones (see text and extrapolation to the whole chromosome. sizes of the sho overestimate, because the drat genome sequence contains some artefactual sequence owing to inability to correctly to merge al yielding a total estime heterochromatic regions and acrocentric short arm(s) NatuReVoL409115FebRuAry2001www.nature.comAe2001MacmillanMagazinesLtd
articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 873 Table 7 Sequence level contiguity of the draft genome sequence Chromosome Initial sequence contigs Sequence contigs Sequence-contig scaffolds Number N50 length (kb) Number N50 length (kb) Number N50 length (kb) All 396,913 21.7 149,821 81.9 87,757 274.3 1 37,656 16.5 12,256 59.1 5,457 278.4 2 32,280 19.9 13,228 57.3 6,959 248.5 3 38,848 15.6 15,098 37.7 8,964 167.4 4 28,600 16.0 13,152 33.0 7,402 158.9 5 30,096 20.4 10,689 72.9 6,378 241.2 6 17,472 43.6 5,547 180.3 2,554 485.0 7 12,733 86.4 4,562 335.7 2,726 591.3 8 19,042 18.1 8,984 38.2 4,631 198.9 9 15,955 20.1 6,226 55.6 3,766 216.2 10 21,762 18.7 9,126 47.9 6,886 133.0 11 29,723 14.3 8,503 40.0 4,684 193.2 12 22,050 19.1 8,422 63.4 5,526 217.0 13 13,737 21.7 5,193 70.5 2,659 300.1 14 4,470 161.4 829 1,371.0 541 2,009.5 15 13,134 15.3 5,840 30.3 3,229 149.7 16 10,297 34.4 4,916 119.5 3,337 356.3 17 10,369 22.9 4,339 90.6 2,616 248.9 18 16,266 15.3 4,461 51.4 2,540 216.1 19 6,009 38.4 2,503 134.4 1,551 375.5 20 2,884 108.6 511 1,346.7 312 813.8 21 103 340.0 5 28,515.3 5 28,515.3 22 526 113.9 11 23,048.1 11 23,048.1 X 11,062 58.8 4,607 218.6 2,610 450.7 Y 557 154.3 140 1,388.6 106 1,439.7 UL 1,282 21.4 613 46.0 297 166.4 ................................................................................................................................................................................................................................................................................................................................................................... This Table is similar to Table 6 but shows the number and N50 length for various types of sequence contig (see Box 1). See legend to Table 6 concerning treatment of gaps. For sequence contigs in the draft genome sequence, the N50 length ranges from 1.7 to 5.5 times the arithmetic mean for initial sequence contigs, 2.5 to 8.2 times for merged sequence contigs, and 6.1 to 10 times for sequence-contig scaffolds. Table 8 Chromosome size estimates Chromosome* Sequenced bases² (Mb) FCC gaps³ SCC gapsk Sequence gaps# Heterochromatin and short arm adjustments**(Mb) Total estimated chromosome size (including artefactual duplication in draft genome sequence)²² (Mb) Previously estimated chromosome size³³ (Mb) Number Total bases in gaps§ (Mb) Number Total bases in gaps¶ (Mb) Number Total bases in gapsI (Mb) All 2,692.9 897 152.0 4,076 142.7 145,514 80.6 212 3,289 3,286 1 212.2 104 17.7 347 12.1 11,803 6.5 30 279 263 2 221.6 50 8.5 296 10.4 12,880 7.1 3 251 255 3 186.2 71 12.1 336 11.8 14,689 8.1 3 221 214 4 168.1 39 6.6 343 12.0 12,768 7.1 3 197 203 5 169.7 46 7.8 337 11.8 10,304 5.7 3 198 194 6 158.1 15 2.6 275 9.6 5,225 2.9 3 176 183 7 146.2 27 4.6 195 6.8 4,338 2.4 3 163 171 8 124.3 41 7.0 249 8.7 8,692 4.8 3 148 155 9 106.9 19 3.2 122 4.3 6,083 3.4 22 140 145 10 127.1 14 2.4 163 5.7 8,947 5.0 3 143 144 11 128.6 29 4.9 193 6.8 8,279 4.6 3 148 144 12 124.5 26 4.4 168 5.9 8,226 4.6 3 142 143 13 92.9 12 2.0 115 4.0 5,065 2.8 16 118 114 14 86.9 13 2.2 40 1.4 775 0.4 16 107 109 15 73.4 18 3.1 104 3.6 5,717 3.2 17 100 106 16 73.1 55 9.4 102 3.6 4,757 2.6 15 104 98 17 72.8 41 7.0 95 3.3 4,261 2.4 3 88 92 18 72.9 22 3.7 113 4.0 4,324 2.4 3 86 85 19 55.4 49 8.3 108 3.8 2,344 1.3 3 72 67 20 60.5 7 1.2 33 1.2 469 0.3 3 66 72 21 33.8 4 0.1 0 0.0 0 0.0 11 45 50 22 33.8 10 1.0 0 0.0 0 0.0 13 48 56 X 127.7 141 24.0 182 6.4 4,282 2.4 3 163 164 Y 21.8 6 1.0 19 0.7 113 0.1 27 51 59 NA 5.1 0 0 134 0.0 577 0.3 0 0 0 UL 9.3 38 0 7 0.0 566 0.3 0 0 0 ................................................................................................................................................................................................................................................................................................................................................................... * NA, sequenced clones that could not be associated with ®ngerprint clone contigs. UL, clone contigs that could not be reliably placed on a chromosome. ² Total number of bases in the draft genome sequence, excluding gaps. Total length of scaffold (including gaps contained within clones) is 2.916 Gb. ³ Gaps between those ®ngerprint clone contigs that contain sequenced clones excluding gaps for centromeres. § For un®nished chromosomes, we estimate an average size of 0.17 Mb per FCC gap, based on retrospective estimates of the clone coverage of chromosomes 21 and 22. Gap estimates for chromosomes 21 and 22 are taken from refs 93, 94. k Gaps between sequenced-clone contigs within a ®ngerprint clone contig. ¶ For un®nished chromosomes, we estimate sequenced clone gaps at 0.035 Mb each, based on evaluation of a sample of these gaps. # Gaps between two sequence contigs within a sequenced-clone contig. I We estimate the average number of bases in sequence gaps from alignments of the initial sequence contigs of un®nished clones (see text) and extrapolation to the whole chromosome. ** Including adjustments for estimates of the sizes of the short arms of the acrocentric chromosomes 13, 14, 15, 21 and 22 (ref. 105), estimates for the centromere and heterochromatic regions of chromosomes 1, 9 and 16 (refs 106, 107) and estimates of 3 Mb for the centromere and 24 Mb for telomeric heterochromatin for the Y chromosome108. ²² The sum of the ®ve lengths in the preceding columns. This is an overestimate, because the draft genome sequence contains some artefactual sequence owing to inability to correctly to merge all underlying sequence contigs. The total amount of artefactual duplication varies among chromosomes; the overall amount is estimated by computational analysis to be about 100 Mb, or about 3% of the total length given, yielding a total estimated size of about 3,200 Mb for the human genome. ³³ Including heterochromatic regions and acrocentric short arm(s)105. © 2001 Macmillan Magazines Ltd
articles fingerprint map. However, many involve STSs that have been landmark content remain difficult to place. Full utilization of localized on only one or two of the previous maps or that occur the higher resolution radiation hybrid map(the TNG map)may as isolated discrepancies in conflict with several flanking STSs. help in this. Future targeted FISH experiments and increased map Many of these cases are probably due to errors in the previous continuity will also facilitate positioning of these sequences maps(with error rates for individual maps estimated at 1-2% Genome coverage (e-PCR) computer program) or to database entries that contain genome not represented within the current version sequence data from more than one clone (owing to cross- Gaps in draft genome sequence coverage. There are three types of gap in the draft genome sequence: gaps within unfinished Graphical views of the independent data sets were particularly sequenced clones; gaps between sequenced-clone contigs, but seful in detecting problems with order or orientation(Fig. 5). within fingerprint clone contigs; and gaps between fingerprint Areas of conflict were reviewed and corrected if orted by the clone contigs. The first two types are relatively straightforward to underlying data. In the version discussed here, there were 41 close simply by performing additional sequencing and finishing on sequenced clones falling in 14 sequenced-clone contigs with STs already identified clones. Closing the third type may require screen- content information from multiple maps that disagreed with the ing of additional large-insert clone libraries and possibly new flanking clones or sequenced-clone contigs; the placement of these technologies for the most recalcitrant regions. We consider these clones thus remains suspect. Four of these instances suggest errors three cases in turn in the fingerprint map, whereas the others suggest errors in the We estimated the size of gaps within draft clones by studying layout of sequenced clones. These cases are being investigated and instances in which there was substantial overlap between a draft clone and a finished clone, as described above. The average gap siz Assembly of the sequenced clones. We assessed the accuracy of the in these draft sequenced clones was 554 bp, although the precise assembly by using a set of 148 draft clones comprising 22. 4 Mb for estimate was sensitive to certain assumptions in the analysis which finished sequence subsequently became available. The Assuming that the sequence gaps in the draft genome sequence tion,and Gig Assembler attempts to use linking data to infer such (likely range 2-4%)of sequence may lie in the 145,514 gaps within information as far as possible. Starting with initial sequence draft sequenced clones contigs that were unordered and unoriented, the program placed The gaps between sequenced-clone contigs but within fingerprint 90% of the initial sequence contigs in the correct orientation and clone contigs are more difficult to evaluate directly, because the 85% in the correct order with respect to one another. In a separate draft genome sequence flanking many of the gaps is often not test, Gig Assembler was tested on simulated draft data produced precisely aligned with the fingerprinted clones. However, most are from finished sequence on chromosome 22 and similar results were much smaller than a single BAC. In fact, nearly three-quarters of obtained these gaps are bridged by one or more individual BACs, as indicated Some problems remain at all levels. First, errors in the initial by linking information from BAC end sequences. We measured the sequence contigs persist in the merged sequence contigs built from sizes of a subset of gaps directly by examining restriction fragment them and can cause difficulties in the assembly of the draft genome fingerprints of overlapping clones. A study of 157"bridged gaps and sequence. Second, Gig Assembler may fail to merge some over- 55'unbridged gaps gave an average gap size of 25 kb. Allowing for the lapping sequences because of poor data quality, allelic differences or possibility that these gaps may not be fully repre ve and that lisassemblies of the initial sequence contigs; this may result in some restriction fragments are not included in the calculation,a more apparent local duplication of a sequence. We have estimated by conservative estimate of gap size would be 35 kb. This would indicate various methods the amount of such artefactual duplication in the that about 150 Mb or 5%of the human genome may reside in the assembly from these and other sources to be about 100 Mb. On the 4,076 gaps between sequenced-clone contigs. This sequence should other hand, nearby duplicated sequences may occasionally be incor- be readily obtained as the clones spanning them are sequenced. thes merged. Some sequenced clones remain incorrectly placed on The size of the gaps between fingerprint clone contigs was the layout, as discussed above, and others(<0.5%)remain unplaced. estimated by comparing the fingerprint maps to the essentially The fingerprint map has undoubtedly failed to resolve some closely completed chromosomes 21 and 22. The analysis shows that the related duplicated regions, such as the williams region and several fingerprinted BAC clones in the global database cover 97-98% of highly repetitive subtelomeric and pericentric regions(see below). the sequenced portions of those chromosomes. The published Detailed examination and sequence finishing may be required to sequences of these chromosomes also contain a few small gaps(5 ort out these regions precisely, as has been done with chromosome and 11, respectively) amounting to some 1.6% of the euchromatic Y. Finally, small sequenced-clone contigs with limited or no STs sequence, and do not include the heterochromatic portion. This suggests that the gaps between contigs in the fingerprint ma Table 9 Dis trouton or Phrae scores in tne gran genome sequence- closure of such gaps on chromosomes 20 and 7 suggests that ma PHRAP score Percentage of bases in the dratt of these gaps are less than one clone in length and will be closed by clones from other libraries. However, recovery of sequence from these gaps represents the most challenging aspect of producing a complete finished sequence of the human genome. As another measure of the representation of the BAC libraries, Riethman0 has found BAC or cosmid clones that link to telomeric half-YACs or to the telomeric sequence itself for 40 of the 41 non- satellite telomeres. Thus, the fingerprint map appears to have no 35.9 substantial gaps in these regions. Many of th bicentric so represented, but analysis is less complete here(see below) of 10-0. Thus, PHRAP scores of 20, 30 and 40correspondto Representation of random raw sequences. In another approach to ctively. PHRAP ertyingsequencereadsusedinsequenceassemblySeehttp://www.gen are deried rom qualty measuring coverage, we compared a collection of random raw washington edw/WGC/analysistool/phrap. htm sequence reads to the existing draft genome sequence. In principle, 87 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
®ngerprint map. However, many involve STSs that have been localized on only one or two of the previous maps or that occur as isolated discrepancies in con¯ict with several ¯anking STSs. Many of these cases are probably due to errors in the previous maps (with error rates for individual maps estimated at 1±2%100). Others may be due to incorrect assignment of the STSs to the draft genome sequence (by the electronic polymerase chain reaction (e-PCR) computer program) or to database entries that contain sequence data from more than one clone (owing to crosscontamination). Graphical views of the independent data sets were particularly useful in detecting problems with order or orientation (Fig. 5). Areas of con¯ict were reviewed and corrected if supported by the underlying data. In the version discussed here, there were 41 sequenced clones falling in 14 sequenced-clone contigs with STS content information from multiple maps that disagreed with the ¯anking clones or sequenced-clone contigs; the placement of these clones thus remains suspect. Four of these instances suggest errors in the ®ngerprint map, whereas the others suggest errors in the layout of sequenced clones. These cases are being investigated and will be corrected in future versions. Assembly of the sequenced clones. We assessed the accuracy of the assembly by using a set of 148 draft clones comprising 22.4 Mb for which ®nished sequence subsequently became available104. The initial sequence contigs lack information about order and orientation, and GigAssembler attempts to use linking data to infer such information as far as possible104. Starting with initial sequence contigs that were unordered and unoriented, the program placed 90% of the initial sequence contigs in the correct orientation and 85% in the correct order with respect to one another. In a separate test, GigAssembler was tested on simulated draft data produced from ®nished sequence on chromosome 22 and similar results were obtained. Some problems remain at all levels. First, errors in the initial sequence contigs persist in the merged sequence contigs built from them and can cause dif®culties in the assembly of the draft genome sequence. Second, GigAssembler may fail to merge some overlapping sequences because of poor data quality, allelic differences or misassemblies of the initial sequence contigs; this may result in apparent local duplication of a sequence. We have estimated by various methods the amount of such artefactual duplication in the assembly from these and other sources to be about 100 Mb. On the other hand, nearby duplicated sequences may occasionally be incorrectly merged. Some sequenced clones remain incorrectly placed on the layout, as discussed above, and others (, 0.5%) remain unplaced. The ®ngerprint map has undoubtedly failed to resolve some closely related duplicated regions, such as the Williams region and several highly repetitive subtelomeric and pericentric regions (see below). Detailed examination and sequence ®nishing may be required to sort out these regions precisely, as has been done with chromosome Y89. Finally, small sequenced-clone contigs with limited or no STS landmark content remain dif®cult to place. Full utilization of the higher resolution radiation hybrid map (the TNG map) may help in this95. Future targeted FISH experiments and increased map continuity will also facilitate positioning of these sequences. Genome coverage We next assessed the nature of the gaps within the draft genome sequence, and attempted to estimate the fraction of the human genome not represented within the current version. Gaps in draft genome sequence coverage. There are three types of gap in the draft genome sequence: gaps within un®nished sequenced clones; gaps between sequenced-clone contigs, but within ®ngerprint clone contigs; and gaps between ®ngerprint clone contigs. The ®rst two types are relatively straightforward to close simply by performing additional sequencing and ®nishing on already identi®ed clones. Closing the third type may require screening of additional large-insert clone libraries and possibly new technologies for the most recalcitrant regions. We consider these three cases in turn. We estimated the size of gaps within draft clones by studying instances in which there was substantial overlap between a draft clone and a ®nished clone, as described above. The average gap size in these draft sequenced clones was 554 bp, although the precise estimate was sensitive to certain assumptions in the analysis. Assuming that the sequence gaps in the draft genome sequence are fairly represented by this sample, about 80 Mb or about 3% (likely range 2±4%) of sequence may lie in the 145,514 gaps within draft sequenced clones. The gaps between sequenced-clone contigs but within ®ngerprint clone contigs are more dif®cult to evaluate directly, because the draft genome sequence ¯anking many of the gaps is often not precisely aligned with the ®ngerprinted clones. However, most are much smaller than a single BAC. In fact, nearly three-quarters of these gaps are bridged by one or more individual BACs, as indicated by linking information from BAC end sequences. We measured the sizes of a subset of gaps directly by examining restriction fragment ®ngerprints of overlapping clones. A study of 157 `bridged' gaps and 55 `unbridged' gaps gave an average gap size of 25 kb. Allowing for the possibility that these gaps may not be fully representative and that some restriction fragments are not included in the calculation, a more conservative estimate of gap size would be 35 kb. This would indicate that about 150 Mb or 5% of the human genome may reside in the 4,076 gaps between sequenced-clone contigs. This sequence should be readily obtained as the clones spanning them are sequenced. The size of the gaps between ®ngerprint clone contigs was estimated by comparing the ®ngerprint maps to the essentially completed chromosomes 21 and 22. The analysis shows that the ®ngerprinted BAC clones in the global database cover 97±98% of the sequenced portions of those chromosomes86. The published sequences of these chromosomes also contain a few small gaps (5 and 11, respectively) amounting to some 1.6% of the euchromatic sequence, and do not include the heterochromatic portion. This suggests that the gaps between contigs in the ®ngerprint map contain about 4% of the euchromatic genome. Experience with closure of such gaps on chromosomes 20 and 7 suggests that many of these gaps are less than one clone in length and will be closed by clones from other libraries. However, recovery of sequence from these gaps represents the most challenging aspect of producing a complete ®nished sequence of the human genome. As another measure of the representation of the BAC libraries, Riethman109 has found BAC or cosmid clones that link to telomeric half-YACs or to the telomeric sequence itself for 40 of the 41 nonsatellite telomeres. Thus, the ®ngerprint map appears to have no substantial gaps in these regions. Many of the pericentric regions are also represented, but analysis is less complete here (see below). Representation of random raw sequences. In another approach to measuring coverage, we compared a collection of random raw sequence reads to the existing draft genome sequence. In principle, articles 874 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Table 9 Distribution of PHRAP scores in the draft genome sequence PHRAP score Percentage of bases in the draft genome sequence 0±9 0.6 10±19 1.3 20±29 2.2 30±39 4.8 40±49 8.1 50±59 8.7 60±69 9.0 70±79 12.1 80±89 17.3 .90 35.9 ............................................................................................................................................................................. PHRAP scores are a logarithmically based representation of the error probability. A PHRAP score of X corresponds to an error probability of 10-X/10. Thus, PHRAP scores of 20, 30 and 40 correspond to accuracy of 99%, 99.9% and 99.99%, respectively. PHRAP scores are derived from quality scores of the underlying sequence reads used in sequence assembly. See http://www.genome. washington.edu/UWGC/analysistools/phrap.htm. © 2001 Macmillan Magazines Ltd