THE HUMAN GENOME roperly place a Celera read, so all reads were assembly took place but not enough Celera Chimeric or contaminating sequence (from other part of the genome) would not be ted into the reassembly of the com- it did not belong there. In pplication of th sulted in an average ue sequence in each assembly ing of an average of 58.I con ermine the size 873 bp. Basically, some small bases of each assembly that were www.sciencemagorg science vol 291 16 february 2001 1313
properly place a Celera read, so all reads were first masked against a library of common repetitive elements, and only matches of at least 40 bp to unmasked portions of the read constituted a hit. Of Celera’s 27.27 million reads, 20.76 million matched a bactig and another 0.62 million reads, which did not have any matches, were nonetheless identified as belonging in the region of the bactig’s BAC because their mate matched the bactig. Of the remaining reads, 2.92 million were completely screened out and so could not be matched, but the other 2.97 million reads had unmasked sequence totaling 1.189 Gbp that were not found in the GenBank data set. Because the Celera data are 5.113 redundant, we estimate that 240 Mbp of unique Celera sequence is not in the GenBank data set. In the next step of the CSA process, a combining assembler took the relevant 53 Celera reads and bactigs for a BAC entry, and produced an assembly of the combined data for that locale. These high-quality sequence reconstructions were a transient result whose utility was simply to provide more reliable information for the purposes of their tiling into sets of overlapping and adjacent scaffold sequences in the next step. In outline, the combining assembler first examines the set of matching Celera reads to determine if there are excessive pileups indicative of unscreened repetitive elements. Wherever these occur, reads in the repeat region whose mates have not been mapped to consistent positions are removed. Then all sets of mate pairs that consistently imply the same relative position of two bactigs are bundled into a link and weighted according to the number of mates in the bundle. A “greedy” strategy then attempts to order the bactigs by selecting bundles of mate-pairs in order of their weight. A selected mate-pair bundle can tie together two formative scaffolds. It is incorporated to form a single scaffold only if it is consistent with the majority of links between contigs of the scaffold. Once scaffolding is complete, gaps are filled by the “Stones” strategy described above for the WGA assembler. The GenBank data for the Phase 1 and 2 BACs consisted of an average of 19.8 bactigs per BAC of average size 8099 bp. Application of the combining assembler resulted in individual Celera BAC assemblies being put together into an average of 1.83 scaffolds (median of 1 scaffold) consisting of an average of 8.57 contigs of average size 18,973 bp. In addition to defining order and orientation of the sequence fragments, there were 57% fewer gaps in the combined result. For Phase 0 data, the average GenBank entry consisted of 91.52 reads of average length 784 bp. Application of the combining assembler resulted in an average of 54.8 scaffolds consisting of an average of 58.1 contigs of average size 873 bp. Basically, some small amount of assembly took place, but not enough Celera data were matched to truly assemble the 0.53 to 13 data set represented by the typical Phase 0 BACs. The combining assembler was also applied to the Phase 3 BACs for SNP identification, confirmation of assembly, and localization of the Celera reads. The phase 0 data suggest that a combined wholegenome shotgun data set and 13 light-shotgun of BACs will not yield good assembly of BAC regions; at least 33 light-shotgun of each BAC is needed. The 5.89 million Celera fragments not matching the GenBank data were assembled with our whole-genome assembler. The assembly resulted in a set of scaffolds totaling 442 Mbp in span and consisting of 326 Mbp of sequence. More than 20% of the scaffolds were .5 kbp long, and these averaged 63% sequence and 27% gaps with a total of 302 Mbp of sequence. All scaffolds .5 kbp were forwarded along with all scaffolds produced by the combining assembler to the subsequent tiling phase. At this stage, we typically had one or two scaffolds for every BAC region constituting at least 95% of the relevant sequence, and a collection of disjoint Celera-unique scaffolds. The next step in developing the genome components was to determine the order and overlap tiling of these BAC and Celera-unique scaffolds across the genome. For this, we used Celera’s 50-kbp mate-pairs information, and BAC-end pairs (18) and sequence tagged site (STS) markers (44) to provide longrange guidance and chromosome separation. Given the relatively manageable number of scaffolds, we chose not to produce this tiling in a fully automated manner, but to compute an initial tiling with a good heuristic and then use human curators to resolve discrepancies or missed join opportunities. To this end, we developed a graphical user interface that displayed the graph of tiling overlaps and the evidence for each. A human curator could then explore the implication of mapped STS data, dot-plots of sequence overlap, and a visual display of the mate-pair evidence supporting a given choice. The result of this process was a collection of “components,” where each component was a tiled set of BAC and Celera-unique scaffolds that had been curator-approved. The process resulted in 3845 components with an estimated span of 2.922 Gbp. In order to generate the final CSA, we assembled each component with the WGA algorithm. As was done in the WGA process, the bactig data were shredded into a synthetic 23 shotgun data set in order to give the assembler the freedom to independently assemble the data. By using faux reads rather than bactigs, the assembly algorithm could correct errors in the assembly of bactigs and remove chimeric content in a PFP data entry. Chimeric or contaminating sequence (from another part of the genome) would not be incorporated into the reassembly of the component because it did not belong there. In effect, the previous steps in the CSA process served only to bring together Celera fragments and PFP data relevant to a large contiguous segment of the genome, wherein we applied the assembler used for WGA to produce an ab initio assembly of the region. WGA assembly of the components resulted in a set of scaffolds totaling 2.906 Gbp in span and consisting of 2.654 Gbp of sequence. The chaff, or set of reads not incorporated into the assembly, numbered 6.17 million, or 22%. More than 90.0% of the genome was covered by scaffolds spanning .100 kbp long, and these averaged 92.2% sequence and 7.8% gaps with a total of 2.492 Gbp of sequence. There were a total of 105,264 gaps among the 107,199 contigs that belong to the 1940 scaffolds spanning .100 kbp. The average scaffold size was 1.4 Mbp, the average contig size was 23.24 kbp, and the average gap size was 2.0 kbp where each distribution of sizes was exponential. As such, averages tend to be underrepresentative of the majority of the data. Figure 5 shows a histogram of the bases in scaffolds of various size ranges. Consider also that more than 49% of all gaps were ,500 bp long, more than 62% of all gaps were ,1 kbp, and all gaps are ,100 kbp long. Similarly, more than 73% of the sequence is in contigs . 30 kbp, more than 49% is in contigs .100 kbp, and the largest contig was 1.99 Mbp long. Table 3 provides summary statistics for the structure of this assembly with a direct comparison to the WGA assembly. 2.5 Comparison of the WGA and CSA scaffolds Having obtained two assemblies of the human genome via independent computational processes (WGA and CSA), we compared scaffolds from the two assemblies as another means of investigating their completeness, consistency, and contiguity. From each assembly, a set of reference scaffolds containing at least 1000 fragments (Celera sequencing reads or bactig shreds) was obtained; this amounted to 2218 WGA scaffolds and 1717 CSA scaffolds, for a total of 2.087 Gbp and 2.474 Gbp. The sequence of each reference scaffold was compared to the sequence of all scaffolds from the other assembly with which it shared at least 20 fragments or at least 20% of the fragments of the smaller scaffold. For each such comparison, all matches of at least 200 bp with at most 2% mismatch were tabulated. From this tabulation, we estimated the amount of unique sequence in each assembly in two ways. The first was to determine the number of bases of each assembly that were T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1313 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME not covered by a matching segment in the The CSa assembly was a few percentage In order to determine the effectiveness of other assembly. Some 82.5 Mbp of the Wga points better in terms of coverage and slightly the fingerprint maps and gm99 for mappin (3.95%)was not covered by the Csa, where- more consistent than the WGA, because it scaffolds, we first examined the reliability of as 204.5 Mbp(8.26%)of the CSa was not was in effect performing a few thousand shot- these maps by comparison with large scaf covered by the WGA. This estimate did not gun assemblies of megabase-sized problems, folds. Only 1% of the StS markers on the 10 require any consistency of the assemblies or whereas the wGa is performing a shotgun largest scaffolds ( those >9 Mbp)were any uniqueness of the matching segments. assembly of a gigabase-sized problem. When mapped on a different chromosome on Thus, another analysis was conducted one considers the increase of two-and-a-half GM99. Two percent of the STS markers dis- hich matches of less than 1 kbp between a orders of magnitude in problem size, the in- agreed in position by more than five frame- air of scaffolds were excluded unless they formation loss between the two is remarkably work bins. However, for the fingerprint rere confirmed by other matches having a small Because CSa was logistically easier to maps, a 2% chromosome discrepancy was consistent order and orientation. This gives deliver and the better of the two results avail- observed, and on average 23. 8% of bac some measure of consistent coverage: 1.982 able at the time when downstream analyses locations in the scaffold sequence disagreed Gbp (95.00%)of the WGa is covered by the needed to be begun, all subsequent analysis with fingerprint map placement by more than CSA, and 2. 169 Gbp(87.69%)of the CSA is was performed on this assembly five BACs. When further examining the covered by the wga by this more stringent ource of discrepancy, it was found that most measure 2.6 Mapping scaffolds to the genome of the discrepancy came from 4 of the The comparison of WGa to CSa also The final step in assembling the genome was to scaffolds, indicating this there is variati mitted evaluation of scaffolds for structur- order and orient the scaffolds on the chromo- the quality of either the map or the scaffolds. al inconsistencies. We looked for instances in somes. We first grouped scaffolds together on All four scaffolds were assembled, as well as o hich a large section of a scaffold from one the basis of their order in the components from the other six, as judged by clone coverage Q other assembly, but failed to match over the by examining residual mate-pairing data be- ancy rate to GM99, and thus we concluded full length of the overlap implied by the tween the scaffolds. We next mapped the scaf- that the fingerprint map global order in these matching segments. An initial set of candi- fold groups onto the chromosome using physi- cases was not reliable. Smaller scaffolds had dates was identified automatically, and then cal mapping data. This step depends on having a higher discordance rate with GM99(4.21% a each candidate was inspected by hand From reliable high-resolution map information such of STSs were discordant by more than five this process, we identified 31 instances in that each scaffold will overlap multiple mark- framework bins), but a lower discordance rate c which the assemblies appear to disagree in a ers. There are two genome-wide types of map with the fingerprint maps(11% of BAcs s In addition, we evaluated local inconsis- nome-wide STS maps, GeneMap99(GM99) fold construction was better supported by s is in error and why encies of order or orientation. The following has the most markers and therefore was most long-range mate pairs in larger scaffolds than results exclude cases in which one contig in useful for mapping scaffolds. The two different in small scaffolds one assembly corresponds to more than one mapping approaches are complementary to one We created two orderings of Celera sca overlapping contig in the other assembly (as another. The fingerprint maps should have bet- folds on the basis of the markers(BAC or long as the order and orientation of the latter ter local order because they were built by com- STS)on these maps. Where the order o agrees with the positions they match in the parison of overlapping BAC clones. On the scaffolds agreed between GM99 and the involved segments on the order of hundreds long-range order, because the framework mark- confidence thatthat order was correct:; these s of base pairs and rarely >l kbp. We found a ers were derived from well-validated genetic scaffolds were termed "anchor scaffolds. "0 total of 295 kbp(0.012%)in the CSA assem- maps. Both types of maps were used Only scaffolds with a low overall discrepancy blies that were locally inconsistent with the reference for human curation of the compo- rate with both maps were considered anchor 9 WGA assemblies, whereas 2.108 Mbp nents that were the input to the regional assem- scaffolds Scaffolds in GM99 bins were al- 3 (0. 11%)in the WGa assembly were incon- bly, but they did not determine the order of lowed to permute in their order to match sistent with the CSa assembly. sequences produced by the assembler WashU ordering, provided they did not vio- late their framework orders. Orientation of individual scaffolds was determined by the presence of multiple mapped markers with 45% onsistent order. Scaffolds with only one marker have insufficient information to as- sign orientation. We found 70. 1%of the ge nome in anchored scaffolds. more than 99% of which are also oriented (Table 4). Because GM99 is of lower resolution than the washU number of scaffolds without Sts matches could be ordered relative to the an chored scaffolds because they included se- quence from the same or adjacent BACs on the WashU map. On the other hand, because <30kb30-50kb50-100kb100-500kb0.5-1Mb15Mb510Mb>10Mb of occasional wa Scaffold Size crepancies, a number of scaffolds determined Fig 5 Distribution of scaffold sizes of the CSA. For each range of scaffold sizes, the percent of total to be "unmappable"on the WashU map could equence is indicated be ordered relative to the anchored scaffold 1314 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
not covered by a matching segment in the other assembly. Some 82.5 Mbp of the WGA (3.95%) was not covered by the CSA, whereas 204.5 Mbp (8.26%) of the CSA was not covered by the WGA. This estimate did not require any consistency of the assemblies or any uniqueness of the matching segments. Thus, another analysis was conducted in which matches of less than 1 kbp between a pair of scaffolds were excluded unless they were confirmed by other matches having a consistent order and orientation. This gives some measure of consistent coverage: 1.982 Gbp (95.00%) of the WGA is covered by the CSA, and 2.169 Gbp (87.69%) of the CSA is covered by the WGA by this more stringent measure. The comparison of WGA to CSA also permitted evaluation of scaffolds for structural inconsistencies. We looked for instances in which a large section of a scaffold from one assembly matched only one scaffold from the other assembly, but failed to match over the full length of the overlap implied by the matching segments. An initial set of candidates was identified automatically, and then each candidate was inspected by hand. From this process, we identified 31 instances in which the assemblies appear to disagree in a nonlocal fashion. These cases are being further evaluated to determine which assembly is in error and why. In addition, we evaluated local inconsistencies of order or orientation. The following results exclude cases in which one contig in one assembly corresponds to more than one overlapping contig in the other assembly (as long as the order and orientation of the latter agrees with the positions they match in the former). Most of these small rearrangements involved segments on the order of hundreds of base pairs and rarely .1 kbp. We found a total of 295 kbp (0.012%) in the CSA assemblies that were locally inconsistent with the WGA assemblies, whereas 2.108 Mbp (0.11%) in the WGA assembly were inconsistent with the CSA assembly. The CSA assembly was a few percentage points better in terms of coverage and slightly more consistent than the WGA, because it was in effect performing a few thousand shotgun assemblies of megabase-sized problems, whereas the WGA is performing a shotgun assembly of a gigabase-sized problem. When one considers the increase of two-and-a-half orders of magnitude in problem size, the information loss between the two is remarkably small. Because CSA was logistically easier to deliver and the better of the two results available at the time when downstream analyses needed to be begun, all subsequent analysis was performed on this assembly. 2.6 Mapping scaffolds to the genome The final step in assembling the genome was to order and orient the scaffolds on the chromosomes. We first grouped scaffolds together on the basis of their order in the components from CSA. These grouped scaffolds were reordered by examining residual mate-pairing data between the scaffolds. We next mapped the scaffold groups onto the chromosome using physical mapping data. This step depends on having reliable high-resolution map information such that each scaffold will overlap multiple markers. There are two genome-wide types of map information available: high-density STS maps and fingerprint maps of BAC clones developed at Washington University (45). Among the genome-wide STS maps, GeneMap99 (GM99) has the most markers and therefore was most useful for mapping scaffolds. The two different mapping approaches are complementary to one another. The fingerprint maps should have better local order because they were built by comparison of overlapping BAC clones. On the other hand, GM99 should have a more reliable long-range order, because the framework markers were derived from well-validated genetic maps. Both types of maps were used as a reference for human curation of the components that were the input to the regional assembly, but they did not determine the order of sequences produced by the assembler. In order to determine the effectiveness of the fingerprint maps and GM99 for mapping scaffolds, we first examined the reliability of these maps by comparison with large scaffolds. Only 1% of the STS markers on the 10 largest scaffolds (those .9 Mbp) were mapped on a different chromosome on GM99. Two percent of the STS markers disagreed in position by more than five framework bins. However, for the fingerprint maps, a 2% chromosome discrepancy was observed, and on average 23.8% of BAC locations in the scaffold sequence disagreed with fingerprint map placement by more than five BACs. When further examining the source of discrepancy, it was found that most of the discrepancy came from 4 of the 10 scaffolds, indicating this there is variation in the quality of either the map or the scaffolds. All four scaffolds were assembled, as well as the other six, as judged by clone coverage analysis, and showed the same low discrepancy rate to GM99, and thus we concluded that the fingerprint map global order in these cases was not reliable. Smaller scaffolds had a higher discordance rate with GM99 (4.21% of STSs were discordant by more than five framework bins), but a lower discordance rate with the fingerprint maps (11% of BACs disagreed with fingerprint maps by more than five BACs). This observation agrees with the clone coverage analysis (46) that Celera scaffold construction was better supported by long-range mate pairs in larger scaffolds than in small scaffolds. We created two orderings of Celera scaffolds on the basis of the markers (BAC or STS) on these maps. Where the order of scaffolds agreed between GM99 and the WashU BAC map, we had a high degree of confidence that that order was correct; these scaffolds were termed “anchor scaffolds.” Only scaffolds with a low overall discrepancy rate with both maps were considered anchor scaffolds. Scaffolds in GM99 bins were allowed to permute in their order to match WashU ordering, provided they did not violate their framework orders. Orientation of individual scaffolds was determined by the presence of multiple mapped markers with consistent order. Scaffolds with only one marker have insufficient information to assign orientation. We found 70.1% of the genome in anchored scaffolds, more than 99% of which are also oriented (Table 4). Because GM99 is of lower resolution than the WashU map, a number of scaffolds without STS matches could be ordered relative to the anchored scaffolds because they included sequence from the same or adjacent BACs on the WashU map. On the other hand, because of occasional WashU global ordering discrepancies, a number of scaffolds determined to be “unmappable” on the WashU map could be ordered relative to the anchored scaffolds Fig. 5. Distribution of scaffold sizes of the CSA. For each range of scaffold sizes, the percent of total sequence is indicated. T H E H UMAN G ENOME 1314 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME ith GM99. These scaffolds were termed 2.7 Assembly and validation analysis ness is to measure the content of an independent ordered scaffolds. We found that 13.9% of We analyzed the assembly of the genome set of sequence data in the assembly. We com- the assembly could be ordered by these ad- from the perspectives of pleteness pared 48, 938 STS markers from Genemap99 ditional methods. and thus 84.0% of the (amount of coverage of the genome) and (51) to the scaffolds. Because these markers ome was ordered unambiguously. correctness(the structural accuracy of the were not used in the assembly processes, they Next, all scaffolds that could be placed, order and orientation and the consensus se- provided a truly independent measure of com- but not ordered, between anchors were as- quence of the assembly). pleteness ePCR (53)and BLAST (54)were signed to the interval between the anchored Completeness Completeness is defined as used to locate STSs on the assembled genome. affolds and were deemed to be"bound- the percentage of the euchromatic sequence We found 44, 524(91%)of the STSs in the ed"between them. For example, small scaf- represented in the assembly. This cannot be mapped genome. An additional 2648 markers folds having STS hits from the same Gene- known with absolute certainty until the eu- (5.4%) were found by searching the unas- Map bin or hitting the same BAC cannot be chromatin sequence has been completed. sembled data or"chaff. We identified 1283 ordered relative to each other, but can be However, it is possible to estimate complete- STS markers(2.6%)not found in either Celera assigned a placement boundary relative to ness on the basis of (i) the estimated sizes of sequence or BAc data as of September 2000, other anchored or ordered scaffolds. The intrascaffold gaps;(ii) coverage of the two raising the possibility that these markers may remaining scaffolds either had no localiza- published chromosomes, 21 and 22(48, 49): not be of human origin. If that were the tion information, conflicting information, and (iii) analysis of the percentage of an the Celera assembled sequence would represent or could only be assigned to a generic independent set of random sequences(STs 93.4% of the human genome and the unas- chromosome location. Using the above ap- markers) contained in the assembly. The sembled data 5.5%. for a total of 98.9% cover roaches, 98% of the genome was an- whole-genome libraries contain heterochro- age. Similarly, we compared CSA ag matic sequence and, although no attempt has 36,678 TNG radiation hybrid markers(5a) 8 Finally, we assigned a location for each been made to assemble it, there may be in- using the same method. We found that 32,371 8 bored, ordered, or bounded spreading out the scaffolds per chromosome. gions of heterochromatin as were observed in CSA scaffolds, with 2055 markers(5.6%) We assumed that the remaining unmapped Drosophila(50, 51) found in the remainder. This gave a 94%cov- scaffolds, constituting 2% of the genome The sequences of human chromosomes 21 erage of the genome through another genome- were distributed evenly across the genome. and 22 have been completed to high quality wide survey By dividing the sum of unmapped scaffold and published(48, 49). Although this se- Correctness Correctness is defined as the o lengths with the sum of the number of quence served as input to the assembler, the structural and sequence accuracy of the as- 5 apped scaffolds, we arrived at an estimate finished sequence was shredded into a shot- embly. Because the source sequences for the of interscaffold gap of 1483 bp. This gap was gun data set so that the assembler had the Celera data and the gen Bank data are from used to separate all the scaffolds on each opportunity to assemble it differently from different individuals, we could not directly o chromosome and to assign an offset in the the original sequence in the case of structural compare the consensus sequence of the chromosome polymorphisms or assembly errors in the During the scaffold-mapping effort, we en- BAC data. In particular, the assembler must Table Table 4. Summary of scaffold mapping Scaffolds tional quality assessment and validation analy- scale of components(generally multimega- of confidence(anchored scaffolds have the highest sis. At least 978(3% of 33, 173) BACs were base in size), and so this comparison reveals confidence; unmapped scaffolds have the lowest) believed to have sequence data from more than the level to which the assembler resolves Anchored scaffolds were consistently ordered by one location in the genome (7). This is con- repeats In certain areas, the assembly struc- folds were consistently ordered by at least one of 5 the following: the WashU BAC map, GM99,o above in the Assembly Strategies chromosomes 21 and 22(see below). The thus o, ositions within the CSa assembly and "finished"sequence differently on the basis maps, but their placements were adjacent to a 9 STSs to unique locations in the asembly be. 22 sequences. We examined the reasons why below b affolds had, at most, a chromosome Likewise, it was not always possible to assign more segments than the chromosome 21 and mapped The scaffold subcategories are given cause of genome duplications, repetitive ele- there are more gaps in the Celera sequence than in chromosomes 21 and 22 and expect Because of the time required for an ex- that they may be typical of gaps in other scaffold Number Length(bp) haustive search for a perfect overlap, CSa regions of the genome. In the Celera assem generated 21, 607 intrascaffold gaps where bly, there are 25 scaffolds, each containing at the mate-pair data suggested that the contigs least 10 kb of sequence, that collectively span 1,5261,860,676,67670 should overlap, but no overlap was found. 94.3% of chromosome 21. Sixty-two scaf- 1,246185208864570 These gaps were defined as a fixed 50 bp in folds span 95.7% of chromosome 22. The length and make up 18.6% of the total total length of the gaps remaining in the 839329633,16 12 116, 442 gaps in the CSa assembly. Celera assembly for these two chromosomes Unoriented 1,162 39,602.691 2 We chose not to use the order of exons is 3. 4 Mbp. These gap sequences were ana- bounde 38241368753,46314 implied in cDNA or EST data as a way of lyzed by RepeatMasker and by searching 7,453274,536,42410 ordering scaffolds. The rationale for not us- against the entire genome assembly (52). Unoriented 30, 788 94,217,039 4 ng this data was that doing so would have About 50% of the gap sequence consisted of Unmapped 1182355313,7372 biased certain regions of the assembly by common repetitive elements identified by Re 28125058440.1 arranging scaffolds to fit the transcript data peatMasker; more than half of the remainder ind made validation of both the assembly and was lower copy number repeat elements gene definition processes more difficult A more global way of assessing complete www.sciencemagorgSciEnceVol29116FebRuarY2001 1315
with GM99. These scaffolds were termed “ordered scaffolds.” We found that 13.9% of the assembly could be ordered by these additional methods, and thus 84.0% of the genome was ordered unambiguously. Next, all scaffolds that could be placed, but not ordered, between anchors were assigned to the interval between the anchored scaffolds and were deemed to be “bounded” between them. For example, small scaffolds having STS hits from the same GeneMap bin or hitting the same BAC cannot be ordered relative to each other, but can be assigned a placement boundary relative to other anchored or ordered scaffolds. The remaining scaffolds either had no localization information, conflicting information, or could only be assigned to a generic chromosome location. Using the above approaches, ;98% of the genome was anchored, ordered, or bounded. Finally, we assigned a location for each scaffold placed on the chromosome by spreading out the scaffolds per chromosome. We assumed that the remaining unmapped scaffolds, constituting 2% of the genome, were distributed evenly across the genome. By dividing the sum of unmapped scaffold lengths with the sum of the number of mapped scaffolds, we arrived at an estimate of interscaffold gap of 1483 bp. This gap was used to separate all the scaffolds on each chromosome and to assign an offset in the chromosome. During the scaffold-mapping effort, we encountered many problems that resulted in additional quality assessment and validation analysis. At least 978 (3% of 33,173) BACs were believed to have sequence data from more than one location in the genome (47). This is consistent with the bactig chimerism analysis reported above in the Assembly Strategies section. These BACs could not be assigned to unique positions within the CSA assembly and thus could not be used for ordering scaffolds. Likewise, it was not always possible to assign STSs to unique locations in the assembly because of genome duplications, repetitive elements, and pseudogenes. Because of the time required for an exhaustive search for a perfect overlap, CSA generated 21,607 intrascaffold gaps where the mate-pair data suggested that the contigs should overlap, but no overlap was found. These gaps were defined as a fixed 50 bp in length and make up 18.6% of the total 116,442 gaps in the CSA assembly. We chose not to use the order of exons implied in cDNA or EST data as a way of ordering scaffolds. The rationale for not using this data was that doing so would have biased certain regions of the assembly by rearranging scaffolds to fit the transcript data and made validation of both the assembly and gene definition processes more difficult. 2.7 Assembly and validation analysis We analyzed the assembly of the genome from the perspectives of completeness (amount of coverage of the genome) and correctness (the structural accuracy of the order and orientation and the consensus sequence of the assembly). Completeness. Completeness is defined as the percentage of the euchromatic sequence represented in the assembly. This cannot be known with absolute certainty until the euchromatin sequence has been completed. However, it is possible to estimate completeness on the basis of (i) the estimated sizes of intrascaffold gaps; (ii) coverage of the two published chromosomes, 21 and 22 (48, 49); and (iii) analysis of the percentage of an independent set of random sequences (STS markers) contained in the assembly. The whole-genome libraries contain heterochromatic sequence and, although no attempt has been made to assemble it, there may be instances of unique sequence embedded in regions of heterochromatin as were observed in Drosophila (50, 51). The sequences of human chromosomes 21 and 22 have been completed to high quality and published (48, 49). Although this sequence served as input to the assembler, the finished sequence was shredded into a shotgun data set so that the assembler had the opportunity to assemble it differently from the original sequence in the case of structural polymorphisms or assembly errors in the BAC data. In particular, the assembler must be able to resolve repetitive elements at the scale of components (generally multimegabase in size), and so this comparison reveals the level to which the assembler resolves repeats. In certain areas, the assembly structure differs from the published versions of chromosomes 21 and 22 (see below). The consequence of the flexibility to assemble “finished” sequence differently on the basis of Celera data resulted in an assembly with more segments than the chromosome 21 and 22 sequences. We examined the reasons why there are more gaps in the Celera sequence than in chromosomes 21 and 22 and expect that they may be typical of gaps in other regions of the genome. In the Celera assembly, there are 25 scaffolds, each containing at least 10 kb of sequence, that collectively span 94.3% of chromosome 21. Sixty-two scaffolds span 95.7% of chromosome 22. The total length of the gaps remaining in the Celera assembly for these two chromosomes is 3.4 Mbp. These gap sequences were analyzed by RepeatMasker and by searching against the entire genome assembly (52). About 50% of the gap sequence consisted of common repetitive elements identified by RepeatMasker; more than half of the remainder was lower copy number repeat elements. A more global way of assessing completeness is to measure the content of an independent set of sequence data in the assembly. We compared 48,938 STS markers from Genemap99 (51) to the scaffolds. Because these markers were not used in the assembly processes, they provided a truly independent measure of completeness. ePCR (53) and BLAST (54) were used to locate STSs on the assembled genome. We found 44,524 (91%) of the STSs in the mapped genome. An additional 2648 markers (5.4%) were found by searching the unassembled data or “chaff.” We identified 1283 STS markers (2.6%) not found in either Celera sequence or BAC data as of September 2000, raising the possibility that these markers may not be of human origin. If that were the case, the Celera assembled sequence would represent 93.4% of the human genome and the unassembled data 5.5%, for a total of 98.9% coverage. Similarly, we compared CSA against 36,678 TNG radiation hybrid markers (55a) using the same method. We found that 32,371 markers (88%) were located in the mapped CSA scaffolds, with 2055 markers (5.6%) found in the remainder. This gave a 94% coverage of the genome through another genomewide survey. Correctness. Correctness is defined as the structural and sequence accuracy of the assembly. Because the source sequences for the Celera data and the GenBank data are from different individuals, we could not directly compare the consensus sequence of the asTable 4. Summary of scaffold mapping. Scaffolds were mapped to the genome with different levels of confidence (anchored scaffolds have the highest confidence; unmapped scaffolds have the lowest). Anchored scaffolds were consistently ordered by the WashU BAC map and GM99. Ordered scaffolds were consistently ordered by at least one of the following: the WashU BAC map, GM99, or component tiling path. Bounded scaffolds had order conflicts between at least two of the external maps, but their placements were adjacent to a neighboring anchored or ordered scaffold. Unmapped scaffolds had, at most, a chromosome assignment. The scaffold subcategories are given below each category. Mapped scaffold category Number Length (bp) % Total length Anchored 1,526 1,860,676,676 70 Oriented 1,246 1,852,088,645 70 Unoriented 280 8,588,031 0.3 Ordered 2,001 369,235,857 14 Oriented 839 329,633,166 12 Unoriented 1,162 39,602,691 2 Bounded 38,241 368,753,463 14 Oriented 7,453 274,536,424 10 Unoriented 30,788 94,217,039 4 Unmapped 11,823 55,313,737 2 Known 281 2,505,844 0.1 chromosome Unknown chromosome 11,542 52,807,893 2 T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1315 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME embly against other finished sequence for those that were correct (Table 5). The stan- 5 September 2000(30, 55b).In this latter determining sequencing accuracy at the nu- dard deviations for all Celera libraries were case, Celera mate pairs had to be mapped to leotide level, although this has been done for quite small, less than 15% of the insert the PFP assembly. To avoid mapping errors identifying polymorphisms as described in length, with the exception of a few 50-kbp due to high-fidelity repeats, the only pairs Section 6. The accuracy of the consensus libraries. The 2- and 10-kbp libraries con- mapped were those for which both reads quence is at least 99.96% on the basis of a tained less than 2% invalid mate pairs, where- matched at only one location with less than statistical estimate derived from the quality as the 50-kbp libraries were somewhat higher 6% differences a threshold was set such that (10%). Thus, although the mate-pair infor- sets of five or more simultaneously invalid The structural consistency of the assembly mation was not perfect, its accuracy was such mate pairs indicated a potential breakpoint, can be measured by mate-pair analysis. In a that measuring valid, misoriented, and mis- where the construction of the two assemblies correct assembl ated pair of se- separated pairs with respect to a given assem- differed. The graphic comparison of the Csa quencing reads should be located on the con- bly was deemed to be a reliable instrument chromosome 21 assembly with the published sensus sequence with the correct separation for validation purposes, especially when sev- sequence( Fig 6A)serves as a validation of and orientation between the pairs. A pair is eral mate pairs confirm or deny an ordering. this methodology. Blue tick marks in the termed"valid"when the reads are in the The clone coverage of the genome was panels indicate breakpoints. There were a correct orientation and the distance between 39x, meaning that any given base pair was, similar (small) number of breakpoints them is within the mean 3 standard devi- on average, contained in 39 clones or, equiv- both chromosome sequences. The exception ations of the distribution of insert sizes of the alently, spanned by 39 mate-paired reads. was 12 sets of scaffolds in the Celera assem- library from which the pair was sampled. A Areas of low clone coverage or areas with a bly (a total of 3% of the chromosome length air is termed"misoriented"when the reads high proportion of invalid mate pairs would in 212 single-contig scaffolds) that were 2 are not correctly oriented, and is termed"mis- indicate potential assembly problems. We mapped to the wrong positions because they R separated"when the distance between the computed the coverage of each base in the were too small to be mapped reliably. Figures reads is not in the correct range but the reads assembly by valid mate pairs(Table 6). In 6 and 7 and Table 6 illustrate the mate-pair o are correctly oriented. The mean t the stan- summary, for scaffolds >30 kbp in length, differences and breakpoints between the two dard deviation of each library used by the less than 1% of the Celera assembly was in assemblies. There was a higher percentage in o assembler was determined as described regions of less than 3X clone coverage. Thus, misoriented and misseparated mate pairs above. To validate these, we examined all more than 99% of the assembly, including the large-insert libraries(50 kbp and bac 8 reads mapped to the finished sequence of order and orientation, is strongly supported ends) than in the small-insert libraries in both chromosome 21 (48)and determined how by this measure alone assemblies(Table 6). The large-insert librar- 5 any incorrect mate pairs there were as a We examined the locations and number of ies are more likely to identify discrepancies result of laboratory tracking errors and chi- all misoriented and misseparated mates. In ply because they span a larger segment of merism(two different segments of the ge- addition to doing this analysis on the CSa the genome. The graphic comparison be- o nome cloned into the same plasmid), and how assembly(as of 1 October 2000), we also tween the two assemblies for chromosomes E tight the distribution of insert sizes was for performed a study of the PFP assembly as of (Fig. 6, B and c) shows that there are many Table 5. Mate-pair validation. Celera fragment es were mapped to of mate pairs tested). If the two mates had incorrect relative orienta- the published sequence of chromosome 21. Each mate pair uniquely tion or placement, they were considered invalid(number of invalid mate lapped was evaluated for correct orientation and placement (number pairs) Chromosome 21 Genome 33sE= Library SD mate (bp) tested (%) 2 kbp 2081 3.642 38 2,082 1913 79 8029 1923 118 2,166 34567890 11385 4.319 11370 7,355 3197 9635 5573 10223 928 64888 2.747 5504 5.834 52034 7312 14.1 ,871 51,498 6588 52,282 7454 14.3 7454 1234567 46616 737 158 215 45,418 55788 2,244 11 53.062 27.1 48931 9813 20.1 47845 4.774 47924 106027 27.778 4.8 54973 16415 119 176,50 2.7 2 1316 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
sembly against other finished sequence for determining sequencing accuracy at the nucleotide level, although this has been done for identifying polymorphisms as described in Section 6. The accuracy of the consensus sequence is at least 99.96% on the basis of a statistical estimate derived from the quality values of the underlying reads. The structural consistency of the assembly can be measured by mate-pair analysis. In a correct assembly, every mated pair of sequencing reads should be located on the consensus sequence with the correct separation and orientation between the pairs. A pair is termed “valid” when the reads are in the correct orientation and the distance between them is within the mean 6 3 standard deviations of the distribution of insert sizes of the library from which the pair was sampled. A pair is termed “misoriented” when the reads are not correctly oriented, and is termed “misseparated” when the distance between the reads is not in the correct range but the reads are correctly oriented. The mean 6 the standard deviation of each library used by the assembler was determined as described above. To validate these, we examined all reads mapped to the finished sequence of chromosome 21 (48) and determined how many incorrect mate pairs there were as a result of laboratory tracking errors and chimerism (two different segments of the genome cloned into the same plasmid), and how tight the distribution of insert sizes was for those that were correct (Table 5). The standard deviations for all Celera libraries were quite small, less than 15% of the insert length, with the exception of a few 50-kbp libraries. The 2- and 10-kbp libraries contained less than 2% invalid mate pairs, whereas the 50-kbp libraries were somewhat higher (;10%). Thus, although the mate-pair information was not perfect, its accuracy was such that measuring valid, misoriented, and misseparated pairs with respect to a given assembly was deemed to be a reliable instrument for validation purposes, especially when several mate pairs confirm or deny an ordering. The clone coverage of the genome was 393, meaning that any given base pair was, on average, contained in 39 clones or, equivalently, spanned by 39 mate-paired reads. Areas of low clone coverage or areas with a high proportion of invalid mate pairs would indicate potential assembly problems. We computed the coverage of each base in the assembly by valid mate pairs (Table 6). In summary, for scaffolds .30 kbp in length, less than 1% of the Celera assembly was in regions of less than 33 clone coverage. Thus, more than 99% of the assembly, including order and orientation, is strongly supported by this measure alone. We examined the locations and number of all misoriented and misseparated mates. In addition to doing this analysis on the CSA assembly (as of 1 October 2000), we also performed a study of the PFP assembly as of 5 September 2000 (30, 55b). In this latter case, Celera mate pairs had to be mapped to the PFP assembly. To avoid mapping errors due to high-fidelity repeats, the only pairs mapped were those for which both reads matched at only one location with less than 6% differences. A threshold was set such that sets of five or more simultaneously invalid mate pairs indicated a potential breakpoint, where the construction of the two assemblies differed. The graphic comparison of the CSA chromosome 21 assembly with the published sequence (Fig. 6A) serves as a validation of this methodology. Blue tick marks in the panels indicate breakpoints. There were a similar (small) number of breakpoints on both chromosome sequences. The exception was 12 sets of scaffolds in the Celera assembly (a total of 3% of the chromosome length in 212 single-contig scaffolds) that were mapped to the wrong positions because they were too small to be mapped reliably. Figures 6 and 7 and Table 6 illustrate the mate-pair differences and breakpoints between the two assemblies. There was a higher percentage of misoriented and misseparated mate pairs in the large-insert libraries (50 kbp and BAC ends) than in the small-insert libraries in both assemblies (Table 6). The large-insert libraries are more likely to identify discrepancies simply because they span a larger segment of the genome. The graphic comparison between the two assemblies for chromosome 8 (Fig. 6, B and C) shows that there are many Table 5. Mate-pair validation. Celera fragment sequences were mapped to the published sequence of chromosome 21. Each mate pair uniquely mapped was evaluated for correct orientation and placement (number of mate pairs tested). If the two mates had incorrect relative orientation or placement, they were considered invalid (number of invalid mate pairs). Library type Library no. Chromosome 21 Genome Mean insert size (bp) SD (bp) SD/ mean (%) No. of mate pairs tested No. of invalid mate pairs % invalid Mean insert size (bp) SD (bp) SD/ mean (%) 2 kbp 1 2,081 106 5.1 3,642 38 1.0 2,082 90 4.3 2 1,913 152 7.9 28,029 413 1.5 1,923 118 6.1 3 2,166 175 8.1 4,405 57 1.3 2,162 158 7.3 10 kbp 4 11,385 851 7.5 4,319 80 1.9 11,370 696 6.1 5 14,523 1,875 12.9 7,355 156 2.1 14,142 1,402 9.9 6 9,635 1,035 10.7 5,573 109 2.0 9,606 934 9.7 7 10,223 928 9.1 34,079 399 1.2 10,190 777 7.6 50 kbp 8 64,888 2,747 4.2 16 1 6.3 65,500 5,504 8.4 9 53,410 5,834 10.9 914 170 18.6 53,311 5,546 10.4 10 52,034 7,312 14.1 5,871 569 9.7 51,498 6,588 12.8 11 52,282 7,454 14.3 2,629 213 8.1 52,282 7,454 14.3 12 46,616 7,378 15.8 2,153 215 10.0 45,418 9,068 20.0 13 55,788 10,099 18.1 2,244 249 11.1 53,062 10,893 20.5 14 39,894 5,019 12.6 199 7 3.5 36,838 9,988 27.1 BES 15 48,931 9,813 20.1 144 10 6.9 47,845 4,774 10.0 16 48,130 4,232 8.8 195 14 7.2 47,924 4,581 9.6 17 106,027 27,778 26.2 330 16 4.8 152,000 26,600 17.5 18 160,575 54,973 34.2 155 8 5.2 161,750 27,000 16.7 19 164,155 19,453 11.9 642 44 6.9 176,500 19,500 11.05 Sum 102,894 2,768 2.7 (mean 5 2.7) T H E H UMAN G ENOME 1316 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME ore breakpoints for the PFP assembly than tive splicing and alternative transcription ini- be done with reasonable accuracy when a for the Celera assembly. Figure 7 shows the tiation and termination sites. Our cells are full-length cDNA has been sequenced or a breakpoint map(blue tick marks) for both able to discern within the billions of base highly homologous protein sequence is assemblies of each chromosome in a side-by- pairs of the genomic DNa the signals for known. De novo gene prediction, although side fashion. The order and orientation of initiating transcription and for splicing to- less accurate, is the only way to find gene Celera's assembly shows substantially fewer gether exons separated by a few or hundreds that are not represented by homologous pro- breakpoints except on the two finished chro- of thousands of base pairs. The first step in teins or ESTs. The following section de mosomes. Figure 7 also depicts large gaps characterizing the genome is to define the scribes the methods we have developed to (10 kbp) in both assemblies as red tick structure of each gene and each transcription address these problems for the prediction of marks. In the CSa assembly, the size of all unit protein-coding genes gaps have been estimated on the basis of the The number of protein-coding genes in We have developed a rule-based expert sys- mate-pair data Breakpoints can be caused by mammals has been controversial from the tem, called Otto, to identify and characterize structural polymorphisms, because the two outset. Initial estimates based on reassocia- genes in the human genome(60). Otto attempts assemblies were derived from different hu- tion data placed it between 30,000 to 40,000, to simulate in software the process that a human man genomes. They also reflect the unfin- whereas later estimates from the brain were annotator uses to identify a gene and refine its ished nature of both genome assemblies. 100,000(56). More recent data from both structure. In the process of annotating a region the corporate and public sectors, based on of the genome, a human curator examines the 3 Gene prediction and annotation extrapolations from EST, CpG island, and evidence provided by the computational pipe. Summary. To enumerate the gene inventory, transcript density-based extrapolations, have line(described below) and examines how var we developed an integrated, evidence-based not reduced this variance. The highest recent ious types of evidence relate to one another. A approach named Otto. The evidence used to number of 142, 634 genes emanates from a curator puts different levels of confidence in includes regions conserved between the based on a combination of est data and the certain patterns of evidence to support gene N mouse and human genomes, similarity to association of ESTs with CpG islands (57). annotation. For example, a curator may ex ESTs or other mRNA-derived data, or simi- In stark contrast are three quite different, and amine homology to a number of ESts and larity to other proteins. A comparison of otto much lower estimates: one of -35,000 genes evaluate whether or not they can be connect- a (combined Otto-RefSeq and Otto homology) derived with genome-wide est data and ed into a longer, virtual mRNA. The curator 0 with Genscan, a standard gene-prediction al- sampling procedures in conjunction with would also evaluate the strength of the simi- gorithm,showed greater sensitivity(0.78 ver- chromosome 22 data (58): another of 28,000 larity and the contiguity of the match, in a sus 0.50)and specificity (0.93 versus 0.63)of to 34, 000 genes derived with a comparative essence asking whether any ESTs cross Otto in the ability to define gene structure. methodology involving sequence conserva- splice- junctions and whether the edges of Otto-predicted genes were complemented tion between humans and the puffer fish Te- putative exons have consensus splice sites. with a set of genes from three gene-prediction troodon nigroviridis (59): and a figure of This kind of manual annotation process was E programs that exhibited weaker, but still sig- 35,000 genes, which was derived simply by used to annotate the Drosophila genome nificant, evidence that they may be ex- extrapolating from the density of 770 known The Otto system can promote observed pressed. Conservative criteria, requiring at and predicted genes in the 67 Mbp of chro- evidence to a gene annotation in one of two least two lines of evidence, were used to mosomes 21 and 22, to the approximately ways. First, if the evidence includes a high define a set of 26, 383 genes with good con- bp euchromatic genome. quality match to the sequence of a known fidence that were used for more detailed anal The problem of computational identifica- gene [here defined as a human gene repre Extensive manual curation to establish pre- sequence can be divided into two phases. The database(6)], then Otto can promote this to s cise characterization of gene structure will be first is to partition the sequence into segments a gene annotation In the second method, Otto 8 necessary to improve the results from this that are likely to correspond to individual evaluates a broad spectrum of evidence and genes. This is not trivial and is a weakness of determines if this evidence is adequate to 9 ost de novo gene-finding algorithms support promotion to a g 3.1 Automated gene annotation also critical to determining the number of These processes are described below a gene is a locus of cotranscribed genes in the human gene inventory. The sec- Initially, gene boundaries are predicted on single gene may give rise to multip ond challenge is to construct a gene model the basis of examination of sets of overlap- scripts, and thus multiple distinct that reflects the probable structure of the ping protein and EST matches generated by a with multiple functions, by means of transcript(s)encoded in the region. This can computational pipeline ( 62). This pipeline searches the scaffold sequences against pro- Table 6. Genome-wide mate pair analysis of compartmentalized shotgun(CSA)and PFP assemblies. tein, EST, and genome-sequence databases to define regions of sequence similarity and runs three de novo gene-prediction programs To identify likely gene boundaries, re- Genome % gions of the genome were partitioned by Otto is- on the basis of sequence matches identified pirated paratedf by BLaST. Each of the database sequen matched in the region under analysis compared by an algorithm that takes into 223 13.5 account both coordinates of the matching se- .1 193 type (e.g 73 5.9 protein, EST, and so forth). The results were DataforindividualchromosomescanbefoundinWebfig3onScienceOnlineatwww.sciencemagorg/cgi/content/usedtogroupthematchesintobinsofrelated fuU/291/5507/1304/D MAtes are misseparated if their distance is >3 SD from the mean library size. sequences that may define a gene and identify www.sciencemagorgSciEnceVol29116FebRuarY2001 1317
more breakpoints for the PFP assembly than for the Celera assembly. Figure 7 shows the breakpoint map (blue tick marks) for both assemblies of each chromosome in a side-byside fashion. The order and orientation of Celera’s assembly shows substantially fewer breakpoints except on the two finished chromosomes. Figure 7 also depicts large gaps (.10 kbp) in both assemblies as red tick marks. In the CSA assembly, the size of all gaps have been estimated on the basis of the mate-pair data. Breakpoints can be caused by structural polymorphisms, because the two assemblies were derived from different human genomes. They also reflect the unfinished nature of both genome assemblies. 3 Gene Prediction and Annotation Summary. To enumerate the gene inventory, we developed an integrated, evidence-based approach named Otto. The evidence used to increase the likelihood of identifying genes includes regions conserved between the mouse and human genomes, similarity to ESTs or other mRNA-derived data, or similarity to other proteins. A comparison of Otto (combined Otto-RefSeq and Otto homology) with Genscan, a standard gene-prediction algorithm, showed greater sensitivity (0.78 versus 0.50) and specificity (0.93 versus 0.63) of Otto in the ability to define gene structure. Otto-predicted genes were complemented with a set of genes from three gene-prediction programs that exhibited weaker, but still significant, evidence that they may be expressed. Conservative criteria, requiring at least two lines of evidence, were used to define a set of 26,383 genes with good confidence that were used for more detailed analysis presented in the subsequent sections. Extensive manual curation to establish precise characterization of gene structure will be necessary to improve the results from this initial computational approach. 3.1 Automated gene annotation A gene is a locus of cotranscribed exons. A single gene may give rise to multiple transcripts, and thus multiple distinct proteins with multiple functions, by means of alternative splicing and alternative transcription initiation and termination sites. Our cells are able to discern within the billions of base pairs of the genomic DNA the signals for initiating transcription and for splicing together exons separated by a few or hundreds of thousands of base pairs. The first step in characterizing the genome is to define the structure of each gene and each transcription unit. The number of protein-coding genes in mammals has been controversial from the outset. Initial estimates based on reassociation data placed it between 30,000 to 40,000, whereas later estimates from the brain were .100,000 (56). More recent data from both the corporate and public sectors, based on extrapolations from EST, CpG island, and transcript density–based extrapolations, have not reduced this variance. The highest recent number of 142,634 genes emanates from a report from Incyte Pharmaceuticals, and is based on a combination of EST data and the association of ESTs with CpG islands (57). In stark contrast are three quite different, and much lower estimates: one of ;35,000 genes derived with genome-wide EST data and sampling procedures in conjunction with chromosome 22 data (58); another of 28,000 to 34,000 genes derived with a comparative methodology involving sequence conservation between humans and the puffer fish Tetraodon nigroviridis (59); and a figure of 35,000 genes, which was derived simply by extrapolating from the density of 770 known and predicted genes in the 67 Mbp of chromosomes 21 and 22, to the approximately 3-Gbp euchromatic genome. The problem of computational identification of transcriptional units in genomic DNA sequence can be divided into two phases. The first is to partition the sequence into segments that are likely to correspond to individual genes. This is not trivial and is a weakness of most de novo gene-finding algorithms. It is also critical to determining the number of genes in the human gene inventory. The second challenge is to construct a gene model that reflects the probable structure of the transcript(s) encoded in the region. This can be done with reasonable accuracy when a full-length cDNA has been sequenced or a highly homologous protein sequence is known. De novo gene prediction, although less accurate, is the only way to find genes that are not represented by homologous proteins or ESTs. The following section describes the methods we have developed to address these problems for the prediction of protein-coding genes. We have developed a rule-based expert system, called Otto, to identify and characterize genes in the human genome (60). Otto attempts to simulate in software the process that a human annotator uses to identify a gene and refine its structure. In the process of annotating a region of the genome, a human curator examines the evidence provided by the computational pipeline (described below) and examines how various types of evidence relate to one another. A curator puts different levels of confidence in different types of evidence and looks for certain patterns of evidence to support gene annotation. For example, a curator may examine homology to a number of ESTs and evaluate whether or not they can be connected into a longer, virtual mRNA. The curator would also evaluate the strength of the similarity and the contiguity of the match, in essence asking whether any ESTs cross splice-junctions and whether the edges of putative exons have consensus splice sites. This kind of manual annotation process was used to annotate the Drosophila genome. The Otto system can promote observed evidence to a gene annotation in one of two ways. First, if the evidence includes a highquality match to the sequence of a known gene [here defined as a human gene represented in a curated subset of the RefSeq database (61)], then Otto can promote this to a gene annotation. In the second method, Otto evaluates a broad spectrum of evidence and determines if this evidence is adequate to support promotion to a gene annotation. These processes are described below. Initially, gene boundaries are predicted on the basis of examination of sets of overlapping protein and EST matches generated by a computational pipeline (62). This pipeline searches the scaffold sequences against protein, EST, and genome-sequence databases to define regions of sequence similarity and runs three de novo gene-prediction programs. To identify likely gene boundaries, regions of the genome were partitioned by Otto on the basis of sequence matches identified by BLAST. Each of the database sequences matched in the region under analysis was compared by an algorithm that takes into account both coordinates of the matching sequence, as well as the sequence type (e.g., protein, EST, and so forth). The results were used to group the matches into bins of related sequences that may define a gene and identify Table 6. Genome-wide mate pair analysis of compartmentalized shotgun (CSA) and PFP assemblies.* Genome library CSA PFP % valid % misoriented % misseparated† % valid % misoriented % misseparated† 2 kbp 98.5 0.6 1.0 95.7 2.0 2.3 10 kbp 96.7 1.0 2.3 81.9 9.6 8.6 50 kbp 93.9 4.5 1.5 64.2 22.3 13.5 BES 94.1 2.1 3.8 62.0 19.3 18.8 Mean 97.4 1.0 1.6 87.3 6.8 5.9 *Data for individual chromosomes can be found in Web fig. 3 on Science Online at www.sciencemag.org/cgi/content/ full/291/5507/1304/DC1. †Mates are misseparated if their distance is .3 SD from the mean library size. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1317 on September 27, 2009 www.sciencemag.org Downloaded from